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ABSTRACT 


This paper reviews an area which has evolved over the past 15 years; experimental analysis of computer system 
dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault 
injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to 
dependability evaluation in the three phases of a system’s life; design phase, prototype phase, and operational phase. 
Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a 
classification of research methods or study topics is outlined, followed by the discussion of these methods or topics as 
well as representative studies. 

The statistical techniques introduced include the estimation of parameters and confidence intervals, probability 
distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique 
used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers elec- 
trical-level, logic-level, and function-level fault injection methods as well as representative simulation environments 
such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation 
fault injection methods as well as several software and hybrid tools including FIAT, FERRARI, HYBRID, and FINE. 
The discussion of measurement-based analysis covers measurement and data processing techniques, basic error char- 
acterization, dependency analysis, Markov reward modeling, software dependability, and fault diagnosis. The discus- 
sion involves several important issues studied in the area, including fault models, fast simulation techniques, work- 
load/failure dependency, correlated failures, and software fault tolerance. 
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I. INTRODUCTION 


In computer science more than in physical sciences, the experimenter must decide what to consider and what to 
ignore in data gathering and analysis, sometimes without the benefit of prior information or easily available intuition. 
How to obtain general models from experiments or measurements made in a particular environment is by no means 
clear. This paper discusses the current research in the area of experimental analysis of computer system dependability. 
The discussion centers around methodologies, major developments, and major directions of the research in the area. 

Experimental evaluation of the dependability of a system can be performed at different phases of the system's 
life. In the early design phase, CAD (Computer-Aided Design) environments are used to evaluate a design via simu- 
lations, including simulated fault injections. Fault injection simulations can be used to investigate the effectiveness of 
fault tolerant mechanisms, to evaluate system dependability, and to provide timely feedback to system designers. 
However, simulations need accurate input parameters and the validation of output results. These should be estimated 
based on previous measurement-based analysis. In the prototype phase, a system runs under controlled workload con- 
ditions, and controlled fault injections are used to evaluate the system behavior under faults. Fault injections on real 
systems can provide information about the process from fault occurrence to system recovery, including error latency, 
propagation, detection, and recovery (reconfiguration may be involved). But fault injection can only study artificial 
faults, and it cannot provide some dependability measures such as MTBF (Mean Time Between Failures) and avail- 
ability. In the operational phase, a direct measurement-based approach can be used to measure systems in the field 
under real workloads. The collected data contain a large amount of information about naturally occurring 
errors/failures. Analysis of such data can provide understanding of actual error/failure characteristics and insight into 
analytical models. Although measurement-based analysis is useful for evaluating real systems, it is limited to detected 
errors. Further, conditions in the field can vary widely from one system to another, casting doubt on the statistical 
validity of the results. Thus, all three approaches are complementary and essential for accurate dependability analysis. 

In the design phase, fault injection simulations can be conducted at different levels: the electrical level, the logic 
level, and the function level. The objectives of simulated fault injection are to determine dependability bottlenecks, 
the coverage of error detection/recovery mechanisms, the effectiveness of reconfiguration schemes, the system TTF 
(Tune To Failure) distributions, reliability, availability, performance loss, and other dependability measures. The 
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resulting feedback of simulations can be extremely useful in cost-effective redesign of a system. In this paper, we dis- 
cuss different techniques used in fault injection simulations. We also introduce different levels of simulation tools. 

In the prototype phase, while the objectives of physical fault injections are similar to those of simulated fault 
injections in the design phase, the methods are radically different because real fault injection and monitoring facilities 
are involved. Physical fault injections can be conducted at the hardware level (logic or electrical) or at the software 
level (code or data corruption). Further, heavy-ion radiation techniques can also be used to inject faults and stress a 
system. Instrumentations used in fault injection experiments are illustrated using real examples, including several 
fault injection environments. 

In the operational phase, the measurement-based approach needs to address issues such as how to monitor com- 
puter errors and failures and how to analyze measured data to quantify system dependability characteristics. Although 
there is extensive research on methods for the design and evaluation of fault tolerant systems, little is known about 
how well these strategies work in the field. A study of production systems is valuable not only for accurate evaluation 
but also for identifying reliability bottlenecks in system design. Several issues in measurement-based analysis, includ- 
ing workload/failure dependency, modeling and evaluation based on data, software dependability in the operational 
phase, and fault diagnosis are addressed. 

Results of measurement-based analysis discussed in this paper are based on over 100 machine-years of data 
gathered from IBM, DEC, and Tandem systems. The evaluation methodology discussed includes: the use of the mea- 
sured hardware and software error data to jointly characterize the interdependence between performance and depend- 
ability, correlation analysis to quantify correlated failures and their impact on dependability, Markov reward modeling 
of measured data to evaluate the loss of system service due to errors and failures, and algorithms that use on-line error 
logs to perform automatic fault diagnosis and failure prediction. 

Before discussing methodologies and developments for each of the three phases discussed above, we present an 
overview of the relevant statistical techniques used in this area. The techniques cover the estimation of parameters 
and confidence intervals, distribution characterization including function fitting, and multivariate analysis methods 
including clustering analysis, correlation analysis, and factor analysis. Importance sampling, a statistical technique to 
accelerate Monte Carlo simulation, is also introduced. These techniques are later used in the discussion of analysis of 
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data obtained from fault injections or measured from operational systems. 


In discussing the experimental analysis approaches used in the three phases, a wide range of dependability 
issues, including error latency, error propagation, error detection, error recovery, error correlation, workload/error 
dependency, availability, reliability, performability, and reward rate, are addressed. In addition to presenting method- 
ologies and major developments in each of these phases, we also critique the relative merits and research issues for 
different approaches. Most evaluation techniques introduced are illustrated via case studies of their uses on actual 
systems. 
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II. STATISTICAL TECHNIQUES USED IN THE AREA 

In this section, we will introduce several statistical techniques commonly used in the analysis of data collected 
from fault injections and operational systems and used in simulation. The techniques discussed are not intended to be 
comprehensive. For a comprehensive study of statistical methods, the reader is referred to the advanced texts of 
statistics [Kendall77], [Dillon84J. In particular, we will discuss parameter estimation, distribution characterization, 
and multivariate analysis techniques. Most of these techniques are widely used in every phase of the experimental 
evaluation of dependability. 

2.1. Parameter Estimation 

The most important characteristics of a random variable are its distribution, mean, and variance. In practice, 
means and variances are usually unknown parameters. Thus, how to estimate these unknown parameters from data 
needs to be addressed. 

2.1.1. Point Estimation 

Point estimation is often used in experimental analysis, such as the estimation of the detection coverage from 
fault injections and the estimation of MTBF (mean time between failures) from field data. Each fault injection and 
each failure occurrence can be treated as an experiment. The following theory is based on the assumption that all 
experiments are independent and have the same underlying distribution. 

Given a collection of n experimental outcomes x, , x 2 , • • • x„, of a random variable X, each x, can be considered 
as a value of a random variable X,. These X t 's are independent of each other and identical to X in distribution. The 
set (X„ X 2 , .... X„) is called a random sample of X. Our purpose is to estimate the value of some parameter 0 ( ft 
could be E[X] or Var[X]) of X using a function of X u X 2 , ... , X„. The function used to estimate ft 
e = d(X x ,X 2 ,...,X„), is called an estimator of ft and ft(x, , x 2 , . . . , x„) is said to be a point estimate of ft 

An estimator 6 is called an unbiased estimator of ft it E[ft] = ft The unbiased estimator diat has the minimum 

variance, i.e., it minimizes Var(ft) = E[(ft- e) 2 \ among all ft’s, is said to be the unbiased minimum variance estimator. 

It can be shown that the sample mean 
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(2.1) 


- 1 i 

*=-!*,■ 

« ;=i 

is the unbiased minimum variance linear estimator of the population mean n, and the sample variance 


s 2 = -L t(X, - X) 2 
n~l j-i 

is, under some mild conditions, an unbiased minimum variance quadratic estimator of the 
an estimator 6 converges in probability to d . ; i.e., 


( 22 ) 

population variance <r 2 . If 


jim P( \ e(X u X 2 ,...,X„)-0 I ><r) = 0, 
where fis any small positive number, it is said to be consistent. 


(2.3) 


A. Method of Maximum-Likelihood 

It the functional form of the p.d.f. of the variable is known, the maximum likelihood is a good approach to 
parameter estimation. In many cases, approximate functional forms of empirical distributions can be obtained (to be 
discussed in Section 2.2). For example, the software TTR (Time To Error) in two measured distributed operatinn sys- 
tems was shown to have an hyperexponential distribution (see Section 5.3). In such cases, the maximum likelihood 
method can be used to determine distribution parameters. 

The idea of the maximum likelihood method is to choose an estimator based on the assumption that the 
observed sample is the most likely to occur among all possible samples. The method usually produces estimators that 
have minimum variance and consistence properties. But if the sample size is small, the estimator may be biased. 

Assuming X has a p.d.f. (probability distribution function) f{x\8), where 0 is an unknown parameter, the joint 
p.d.f. of the sample { X { , X 2 , ..., X„ }, 

L(9) = n f(x, \9) (2 4) 

is called the likelihood function ot d. II d(x {9 x 2 , . . . , x n ) is the point estimate of e that maximizes L{6\ then 
8(X { > X 2 > . . . , X„) is said to be the maximum likelihood estimator of 6. 

Now we use an example to illustrate the method. Let X denote the random variable M time between failures" in a 
computer system. Assuming X is exponentially distributed with an arrival rate A, we wish to estimate X from a random 
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sample { X, , X 2 , . . . , X „ } . By Equation (2.4), 


n . -X T X; 

L(A) = UXe- Xx =X n e % . 

1=1 

How do we choose an estimator such that the estimated A maximizes L(A)1 An easier way is to find the A value 
that maximizes InL(A), instead ot L(A). This is because the A that maximizes L(A) also maximizes InL(A), and In L(A) 
is easier to handle. In this case we have 


In L(X) = n ln(X) - X £ Xj . 

i~l 

To find the maximum, consider the first derivative 


djlnLU)] n " 
dA ~ A 


The solution of this equation at zero. 



n 

r=l 


is the maximum likelihood estimator for X . 


B. Method of Moments 

Sometimes it is impossible to find maximum likelihood estimators in closed form. For instance, it is difficult to 
maximize the following p.d.f. of the gamma distribution G(aff) 


in estimating a and 8, because ot the existence ot the gamma function T(a). The gamma distribution is often found 
useful for characterizing interval times in the real world. It will be shown in Section 5.3 that the software TTE in a 
measured single-machine operating system fits a multi-stage gamma distribution. In such cases, the method of 


moments can be used it an analytical relationship between the moments of the variable and the parameters to estimate 
can be found. 


To introduce the method of moments. We first bring out die concepts of sample moment and population 
moment. The £-th (*=1,2,...) sample moment of the random variable X is defined as 
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(2.5) 


where X, , X 2 , . . . , X„ are a sample of X. The k-th population moment of X is just E[ X k ]. 

Suppose there are k parameters to be estimated. The idea of the method of moments is to set the first k sample 
moments equal to the first k population moments which are expressed as the unknown parameters, and then to solve 
these k equations for the unknown parameters. The method usually gives simple and consistent estimators. However, 
some estimators may not have unbiased and minimum variance properties. The following example shows details of 
the method. 

Consider the above gamma distribution example. We wish to estimate a and 9, based on a sample 
{ X, , X 2 , . . . , X„ } from a gamma distribution. Since X - G(a, 9), we know 


E[x] = a9, E[x l \ = ad 2 + a9 . 
The first two sample moments, by definition, are given by 


1 n _ 1 n 

m, = - X = X , m-, = - Y x] = .V 2 + X . 

n,=i ' nU 


Setting m t = £[X] and m 2 = EfX 2 ] and solving for a and 9, we obtain 




These are the estimators for a and 9 from the method of moments. 


2.1.2. Interval Estimation 

So tar our discussion is limited to the point estimation of unknown parameters. The estimate may deviate from 
the actual parameter value. To obtain an estimate with a high confidence, it is necessary to construct an interval esti- 
mate such that the interval includes the actual parameter value with a high probability. Given an estimator 0. if 

P(0-e { <0 < 0 + e 2 ) = /3 , (2.6) 

the random interval (9 - e t , 9 + e 2 ) is said to be 1 OOx/f’l confidence interval for 9, and is called the confidence coef- 
ficient (the probability that the confidence interval contains 9). 
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A. Confidence Intervals for Means 

In the following discussion, the sample mean X will be used as the estimator for the population mean. As men- 
tioned before, it is the unbiased minimum variance linear estimator for fi. We first consider the case in which the sam- 
ple size is large. By the central limit theorem, X is asymptotically normally distributed, no matter what the population 
distribution is. Thus, when the sample size n is reasonably large (usually 30 or above, sometimes at least 50 if the 
population distribution is badly skewed with occasional outliers), Z = (X - M )/(Slft) can be approximately treated as 
a standard normal variable. To obtain a 100/1% confidence interval for //, we can find a number z al2 from the MO, 1) 
distribution table such that P(Z > z a n) = where Then we have 


Thus, the 100(1 - a)% confidence interval for n is approximately 


X ~ Zan fK- M - X + Zaf2 f7i' 


(2.7) 


If the sample size is small (considerably smaller than 30), the above approximation can be poor. In this case, we 
consider two commonly used distributions: normal and exponential. If the population distribution is normal, the ran- 
dom variable T = (X - /u)/(S/fn) has a Student t distribution with n - 1 degrees of freedom. By repeating the same 
approach performed above with a t distribution table, the following 100(1 -a)% confidence interval tor /r can be 

obtained: 


x ~ tn ~ vaa ,n ~ Ua ' 2 It ’ (2 8) 

where r„_,. a/2 is a number such that P (T > t n . han ) = a/2. Theoretically, Equation (3.8) requires that X have a normal 
distribution. However, we will show later that the estimator is not very sensitive to the distribution of X when the 
sample size is reasonably large (15 or more). 


If the population distribution is exponential, it can be shown that z 2 - 2nXI{i has a chi-square distribution with 
2 n degrees of freedom. Thus, the chi-square distribution table should be used. Because the chi-square distribution is 


not symmetrical about the origin, we need to find two numbers, x in-A-an x in-.an^ 
ail and P (% 2 > x 2 2 „ a / 2 ) = The obtained 100(1 - a)% confidence interval tor // is 


such that P (z 2 < * 2 2 «; 1 -<W 2 ) = 
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2 nX 


(2.9) 


2nX 

~2 "■ A* ~i • 

x 2rr.a/2 X 2n\l-a/2 


B. Confidence Intervals for Variances 

The estimation of confidence interval for variances is more complicated than that for means, because the sample 
variance cannot be simply approximated by a unique distribution (such as normal distribution) regardless of the popu- 
lation distribution. However, irrespective of the population distribution, limVar[.V 2 ] = 0. Thus, a good confidence 

interval can be expected as long as the sample size n is large. Next, our discussion w.ll be focused on the two com- 
monly used distributions: normal and exponential. 

If X is normally distributed, the sample variance .V 2 can be used to construct the confidence interval. It is known 
that the random variable (n- l)S 2 /a 2 has a ch.-square distribution with n- 1 degrees of freedom. To determine a 
100(1 -«)% confidence interval for a 2 , we follow the procedure for constructing Equation (3.9) to find the numbers 
x ~n-\-,\- a n and Jr„_i -an from the chi-square distribution table. The confidence interval is then given by 

("- I , (n-l)5 2 

aim 

Similar to Equation (2.8), our expenence shows that this equation is not restricted to the normal distribution when the 
sample size is reasonably large (15 or more). 

If X is exponentially distributed. Equalion (2.9) can be used lo estimate the conhdence interal for a 2 . because 
for the exponent, al random variable, <r 2 equals /t 2 . Since all terns in Equation (2.9) are positive, we can take square 
tor them. The result gives a 100(1 - a)% confidence interval for a 2 : 


, 2nX \2 9 , 2nX , 

(— ) < tr < (— ) 2 


.ts, 


n,at2 


X‘i„. 


n\\~a!2 


(2.11) 


C. Confidence Intervals for Proportions 

Often, we need to estimate the confidence interval for a proportion or percentage whose underlying distribution 
is unknown. For example, we may want to estimate the confidence interval for the detection coverage after fault mo- 
tion experiments. In general, given n Bernoulli trials with the probab.lity of success on each trial being p and the 
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number of successes being r. how do we hud a confidence inlerval lor p? If „ is large (particularly when np > 5 and 

U(1 - f>) > 5 IHogg83|), Yin has an approximately normal distribution, Nip, a‘>, with p = p and cr 2 = pi 1 - p)!n 

Note that Yin is the sample mean which is an esltmale of ft or p. By Eq. (2.7), the HXKl-t,)* conhdence interval for 
p is 


— ± Za/i'l P(1 - p)/n . 


( 2 . 12 ) 


This equation can be used to determine the number of i 


injections required to achieve a given confidence interval 
for an estimated fault detection coverage. Let it represent die number of fault injections and r the number of faults 
detected in the n injections. Assume that all faults have die same deiecdon coverage, which is approxunately p. Now 
we wish to esdmate p with the IOOd-„)% confidence interval being e. By Eq. (2.12), we have 


e ~ z an <m - pVn . 


(2.13) 


Solving the equation for n : 


^afl~ PO p) 


(2.14) 


Where n is die number of injecdons requtred to achieve die desired confideuce innnval in es.mta.ing p. 

For example, assume detecdon coverage ,-0.6, confidence mterval e = 0.05. and confidence coefficient 
1 - a = 90%. Then the required number of injections is 


1 . 645 2 x0. 6x0. 4 
n 0.05 2 = 260 


2.2. Distribution Characterization 

Mean and variance are imporiant parameters that summarize date by single numbers. Probability distribution 
provides more infotmadon about date. Analysis of disirtbudons can help one undersiand date in derail as well as die 


underlying models. For example, if the waiting times in all states of 


a transition model are exponential, then the model 


is a Markov model. Otherwise, it is a semi-Markov model. We will discuss 
tion fitting in this subsection. 


empirical distribution functions and func- 
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2.2.1. Empirical Distribution 


Given a sample of X y the simplest way to obtain an empirical distribution of X is to plot a histogram of the 
observations. The range of the sample space is divided into a number of subranges called buckets. The lengths of the 
buckets do not have to be the same. Assume that we have k buckets, separated by jc 0 , x u ..., x ky for the given sample 

k 

with the size of n. In each bucket, there are y, instances. Obviously, the sample size n is £ y,. Then, y-Jn is an esti- 

;=i 

mation of the probability that X takes a value in bucket i. We will call the histogram an empirical probability distri- 
bution function (p.d.f.) of X. It is easy to construct the following empirical cumulative distribution function (c.d.f.) 
from the histogram. 


F k (x) = 


0, 

if. 

;=i n 

1 , 


x< x 0 

JCf-l ^ X < X, 
X k < X 


(2.15) 


The key problem in plotting histograms is determining the bucket size. A small size may lead to a large varia- 
tion among buckets so that the characterization of the distribution cannot be identified. A large size may lose details of 
the distribution. Given a data set, it is possible to obtain very different distribution shapes by using different bucket 
sizes. One guideline is that if any bucket has less than five instances, the bucket size should be increased or a variable 
bucket size should be used. By our experience, 10 to 50 buckets are appropriate in most cases, depending on the sam- 
ple size. We will call the histogram constructed from data the empirical distribution. 


2.2.2. Function Fitting 

Analytical distribution functions are useful in analytical modeling and simulations. Thus, it is often desirable to 
fit an analytical function to a given empirical distribution. Function fitting is not a trivial task and relies on certain 
knowledge of statistical distribution functions. The procedure given in the following is based on our experience. 
Given an empirical distribution, the first step is to make a good guess of the closest distribution t'unction(s) by observ- 
ing the shape of the empirical distribution. The second step is to use a statistical package such as SAS to obtain the 
parameters for a guessed function by trying to fit it to the empirical distribution. The third step is to perform a signifi- 
cance test of the goodness-of-fit to see if the fitted function is acceptable. If the function is not acceptable, we have to 
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go to step 2 to try a different function. 


Now we discuss step 3 significance test. Assume that the given empirical c.d.f. is F k , defined in Eq (2.15), 
and the hypothesized c.d.f. is F(x ) (obtained from step 2 in the above). Our task is to test the hypothesis 

H 0 : F k ( x) = F(x). 

There are two commonly used goodness-of-fit test methods: the chi-square test and the Kolmogorov-Smirnov 
test. We now briefly introduce the two methods. 

A. Chi-Square Test 

The chi-square test assumes the distribution under consideration can be approximated by a multinomial distribu- 
tion, which usually stands. Let 

Pi = F(Xj) - F(Xi i_, ) , 1 = 1,...,* 

where p, is the probability that an instance falls into bucket i. If we define 

Plx,^ < X t < x,} = p, , i= 1 k , 

then Xj, X 2 , .... X k have a multinomial distribution which is equivalent to the original distribution F(x). Thus, for a 
sample size of n, the expected instances falling into bucket / is np„ by the above distribution. The sum of eiror 
squares divided by the expected numbers 


_ * (y f - n Pi ) 2 ' 
<?*-! = 2 . 


( 2 . 16 ) 


i=i "Pi 

is a measure of the "closeness" of the observed number of instances, y„ to the expected number of instances, np„ in 
bucket «. If q k _ x is small, we tend to accept H 0 . The "smallness" can be measured m terms of statistical significance if 
we treat q k _ x as a particular value of the random variable Q k _ , . It can be shown that if n is large (np, > 1), Q k _ x has an 
approximate chi-square distribution with * - 1 degrees of freedom, x l (k - 1). If H 0 is true, we expect that q k _ x falls 
mto an acceptable range of Q k . x so that the event is likely to occur. The boundary value, or critical value, of the 
acceptable range, ^„(* - 1) is chosen such that 

p \Qk-i > xl(k-\)] = a 

Where a is called (he ***««, level of Ihe «. Thus, we ehoald reject H„ If - I ). Usually, o is cho- 

sen to be 0.05 or 0 . 1 . 
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B. Kolmogorov-Smimov Test 


The Kolmogorov-Smimov test is a non-parametnc method in that it assumes no particular distribution for the 
variable in consideration. The method uses the empirical c.d.f., instead of the empirical p.d.f., to perform the test, 
which is more stringent than the chi-square test. The Kolmogorov-Smimov statistic is defined by 

D k =sup x [\F k (x)-F(x)\], (2 ' 17) 

where su Px represents the least upper bound of all pomtwise differences IF t (x) - F(,)l. In calculation, we can choose 

the midpoint between x M and .r„ for i - 1 *. to obtain the maximum value of I F k (x,) - F(x,)\. It is seen that D k 

is a measure of the closeness of the empirical and hypothesized distribution functions. It can be derived that D k sub- 
mits to a distribution whose c.d.f. values are given by the table of Kolmogorov-Smimov Acceptance Limits [Hogg831. 
Thus, given a significance level a, we can find the critical value d k from the table such that 


P[D k > d k ] = a . 

The hypothesis H 0 is rejected if the calculated value of D k is greater than the critical value d k . Otherwise, we accept 
H o. 


23. Multivariate Analysis 

In reality, measurements are usually made on mom than one variable. For example, a computer workload mea- 
surement may include usages on CPU, memory, disk, and network. A computer failure measurement may collect data 
on multiple components. Multivariate analysis is the application of methods that deal with multiple variables. These 
methods, including clustering analysis, correlation analysis, and factor analysis to be discussed, identify and quantify 
simultaneous relationships among multiple variables. 


2.3.1. Clustering Analysis 

Clustering analysis is useful for characterizing workload states in computer systems by clustering similar points 
in resource usage. Assume we have a sample of p variables with a size of n. We call each instance in the sample a 
point , which consists of p values. Clustering analysis identifies similar points and clusters them into groups (clusters). 
Let x, = (jc ; i , x, 2 , .... x,p) denote the ith point of the sample. The Euclidean distance between points i and ;, 
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xl/2 


d ij ~ I Xi-Xj\ =(£(**/ -Xy /) 2 ) 1 
/=1 

is usually used as a similarity measure between points i and j. 

There are several different clas,enn s algorithms. The goal of these algorithms is to achteve small wtW ,„. clmr 
variation reladve die W variadon. A commonly nsed algorithm is dte clustering afgorithm. 

The algorithm paruttons a sample with p dimensions and n points into k clusters, C„ C, c t . Denote the mean, or 

centroid ot the Cj by S). The error component of the partion is defined as 


E=Z X I Xi-Xfl 2 . 

Mite, 1 (2.18) 

The goal of the k - means algorithm is to find a partition (hat minimizes E . 

The clustering procedure is as follows: Shut with * groups each of which consist, of a single point Each new 
ohjec, is added to die group with die closes, centriod. After a point is added to a group, die mean of dm, group is 
adiusted take die new point mm account After a pari, don is fotmed. search for another pariitton with smaller £ by 
moving points from one cluster to another cluster undi no dansfer of a point results in a reduetton in £. 

2.3.2. Correlation Analysis 

Correlation anaiys,, can he used m quandfy error or workload dependency between two components in a sys- 
tem. The correladon coefhcient. Corf*,, X,). between the random variables X, and X, is delined as 


Cor(X , , * 2 ) = ^ 1 )^ 2 -^;)] 




(2.19) 


whem a, and ns are me means of X, « X 3 , and u, and u 3 me smndard deviadons of X, trnd X 3 , respeettvely. ,f we 
use P m denote me coneladon coemetent. men p sadshes -1 S p s 1. The cormladon coefhcten, is a measure of me 
linear reladonship between two variables. When | p| = ,. we have X, . „x 3 ♦ where b>0 if p = I. or b<0 if p = -1. 

“““ “ “ ““ linear *. - % When mere is no exact linear 

relationship between X, and X, m,s case, p measures me goodness of me linear reladonship X, =„x 3+ h 

between X, and X 3 . Usually, a p value of 0.5 or above is considered reasonably htgh. 
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Given random variables, Xj, X 2 . and X 3 , and correlation coefficients between each pair, p l2 , and p i: ,. we 
know these variables are related each other by p l2 , Pi 2 , and p n . Since X, is related to X 2 and X 2 is related to X 3 , a 
partial dependence between X, and X 3 may be due to X 2 . The partial correlation coefficient defined below quantifies 
this partial dependence. 


P 13.2 = 


P 13 ~ P\2p22 


Partial correlation coefficient can be considered as a measure of the common relationship among the three variables. 


If a random variable, X, is defined on time series, the correlation coefficient can be used to quantify the time 
serial dependence in the sample data of X. Given a time window At > 0, the autocorrelation coefficient of X on the 
time series t is defined as 


Autocor(X, At) = Cor(X(t), X(t + At )) , (2.21) 

where t is defined on the discrete values (At, 2 At, 3 At, ...). In this case, we treat X(t) and X(f + At) as two different 
random variables and the autocorrelation coefficient is actually the correlation coefficient between the two variables. 
That is, Autocor(X, At) measures the time serial correlation of X with a window At. 


2.3.3. Factor Analysis 

The limitation ol correlation analysis is that the correlation coefficient can only quantify dependency between 
two variables. However, dependency may exist within a group of more than two variables or even among all variables. 
The correlation coetficient cannot provide information about this multiple dependency. Factor analysis is one of sta- 
tistical techniques to quantity multi-way dependency among variables. The method attempts to find a set of unob- 
served common factors which link together the observed variables. Consequently, it provides insight into the underly- 
ing structure of the data. For example, in a distributed system, a disk crash can account for failures on those machines 
whose operations depend on a set ot critical data on the disk. The disk state can be considered to be a common factor 
for failures on these machines. 

Let X = , x p ) be a normalized random vector. We say that the it -factor model holds for X if X can be 

written in the form 
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X = AF + E 


(2.22) 


where A = U (> ) 0 = 1 p\ j = 1,. is a matrix of constants called factor loadings, and F = (/, /*) and 

E = (e,, . . . , e k ) are random vectors. The elements of F are called common factors, and the elements of E are called 
unique factors (error terms). These factors are unobservable variables. It is assumed that all factors (both common and 
unique factors) are independent ol each other and that the common factors are normalized. 


Each variable Xj (i — 1, . . . , p), can then be expressed as 

k 

^ ij f j + 

and its variance can be written as 


= X x \ + Wi 

where y, is the variance of e,. Thus, the variance of .r, can be split into two parts. The first part 

;= l 

is called the communality and represents the variance of x, which is shared with the other variables via the common 
factors. In particular X tj = Cor(x„ ff) represents the extent to which t, depends on the yth common factors. The sec- 
ond part, ¥l , is called the unique variance and is due to the unique factor e,; it explains the variability in r, not shared 
with the other variables. 


2.4. Importance Sampling 

Importance sampling is a statistical method to reduce sampling size while keeping estimates obtained from the 
sample at a high level of confidence [Kahn53], The method has been recently used to reduce the number of runs ,n 
Monte Carlo simulations for evaluating computer dependability [Goyal92] [Choi92], In the following, we first give 

an overview of the method and then discuss its applications in the Monte Carlo simulation of discrete-time Markov 
chains (DTMC’s). 
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2.4.1. Overview of the Method 


Assume that a random variable X has p.d.f. fix) and that Y = h(X) is a function of X. Our goal is to estimate 
the expected value of Y , 

e = E[Y] = E[h{X)] = h(x)f(x)dx , (2.23) 

through sampling. That is, we generate a sample {jcj, x 2 , . . . , x n ) according to fix), therefore generating 
(?b V2» ■ • y n }» and then calculate 

_ 1 n \ n 

Q = Y = - £ Yi = - X *(*i) • 

« 1=l n I=1 

It may be very expensive to generate a statistically significant sample of X. For example, if y, = h(Xj) = 0 for 
most generated x it we may need an extremely large size of sample to estimate 0 with a high level of confidence. How- 
ever, if we can make the rare jc f ’s which are "important" for estimating 6 be much more frequently selected in sam- 
pling while keeping the estimate unbiased, the sample size will be greatly reduced. This is the basic idea of the impor- 
tance sampling method. 

To do importance sampling, we change the p.d.f. of X from fix) to gix) such that those ,r\s which are of impor- 
tance in our parameter estimation have higher occurrence probabilities in gix). We use X ' to represent the variable 
which has p.d.f. gix'). By Eq. (2.23), we have 

-foo +«» -foo 

e = h(x)f(x)dx = | h(x ) ^ — g(x)dx = J h(x)A(x)g(x)dx , (2.24) 

where 


A(x) = 


fix) 

g(*) 


is called likelihood ratio. Let Y' = h(X)A(X). then Eq. (2.24) becomes 


(2.25) 


d = J y’g(x)dx = E[Y’ ] . (2.26) 

Thus, instead of sampling from fix) to estimate the expected value of Y . the experiment is changed to sampling 
from gixf to estimate the expected value of Y'. That is, we generate a sample ( t(, x’ z , . . .. x'„ ) according to g(.x'), 
therefore generating { y\ , y' 2> .... y'„ } , and then calculate 
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0= *"=-£>■'=-£ h(x-)Mx') . 
n i~\ n f-i 

The variance of the above estimator is 

J— g(jc) 

To achieve the minimum variance, we should have 

h(x)f(x) 

S« = — 5 — . 

But 6 is the unknown parameter to estimate. A heuristic is that the shape of gU) should follow the shape of h(x)f(x) 
as closely as possible. 

2.4.2. Applications in DTMC Simulation 

In many cases, the operation of a computer system can be modeled by a DTMC (Discrete Time Markov Chain) 
[Trivedi82]. If the built DTMC is very large (such that it exceeds the available storage) or the functional simulation 
(simulation of the execution of machine instructions, algorithms, etc.) is used above a DTMC, the Monte Carlo simu- 
lation method is perhaps the only feasible way to solve the model. In dependability models, system failures are usually 
rare events with extremely small probabilities. In order to obtain statistically significant results, large simulation runs 
are required, which can be very time consuming. In such a case, importance sampling can be used to reduce simula- 
tion runs, usually by orders of magnitude. 

Assume we have a DTMC { Y, s > 0) with a set of states {S l# S 2 * . . . , S m ) and a transition matrix [p i; ]. For 
each simulation run, we have a path x t = S in , S i{ , . . . , S, k . The occurrence probability of path x t is [Goyal92] 

P(Xi) = Pi 0 Pi 0 , , • • ■ /V,i t ’ 

where each p,j is an element in [/?,,]. All possible paths constitute the probability space of a random variable: X = 
Ul. *2> • )• 

To reduce simulation runs, we change the transition probability matrix from [p^ J to [p' l} \ such that those paths 
which are of importance in our dependability evaluation are more likely to be sampled. After the change, the occur- 
rence probability of path x x is 
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P (-^i) Pi 0 P\ Q n Pik-l*k 

Assume the dependability measure to evaluate is e = E[h(X)]. Then 8 can be estimated using a sample, 
{.t,, x 2 , . . ., x n ), obtained from simulations by 

6 = - X h(Xj)A(Xj) , (2.21) 

n i=i 

where 


A (r.) _ = . (2.28) 

P (jCj) Pi a Pi t> i { ' ■ ' P^-i't 

The remaining question is how to determine [p' l Several heuristics called failure biasing have been proposed 
in the literature [Lewis841 [Goyal92], Here we introduce one of the commonly used heuristics. Assume that in state 
S„ transitions out of the state go to either a set of failure states, F (e.g., the system suffers one more component fail- 
ure), or a set of recovery states, R (e.g., the system recovers from a component failure). (S, itself can be treated as 
either in F or in R .) It is obvious that we have 

X Pij + X Pi] = 1 • 

Define a parameter b such that p'/s satisfy 


= i-*- 

j&F ;€/? 


Then we determine each p’ t] in state -V ; by 


(2.29) 


P •} = ■ 


El 

L Pik 

k<=F 


(1 -b) 


El 

L p^ 


jeF 

jeR 


(2.30) 


[ ite R 

The parameter b is usually chosen to be 0.5 [Goyal921. Since the sum of the original probabilities to failure 
states is often very small, by Eq. (2.29), the selection of b can significantly increase these probabilities, thus making 
these transitions much more likely to occur in simulations. 
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nl. DESIGN PHASE 


me early design phase of highly reliable syslems, simnlahon is an .mporiaM experimenlal mean* lor perlor- 
mance and dependabdity analysis. Compared to analytical modeling. simulaUon has die capabdity to model complex 
systems In demil without betng restricted to assumptions made ,n analydcal modeling to beep die model mathemati- 
cally tractable. Thus. simulaUon is able to provide more accurale dependability evaluadon than analydcal models. 
Simulauons lor dependability analysis can be perforined by injecting faults ,n the system under stud, at Ihe elecdical 
level, the logic level, and the luncdon level. Dependability issues studied usually include but are not limited to: 1) 
fault propagation, 2) fault latency, and 3) fault impact such as coverage, reliability, availability, and performance loss. 
Figure 3.1 shows fault injections at the dilferent levels. 

Electrical-level fault injection simulation is usually used to emulate transient faults by changing the electric cur- 
rent and voltage inside the circuits. The faulty current and voltage may cause errors ,n logic values at the gate level. 
The gate-level errors may then propagate to other functional units and output pins of the chip. It has been reported 
dial transient faults account for more than 80% of the failures in computer systems [Sie W .orek78], [Iyer86], These 
faults result from physical causes such as power transients, capacitive or inductive crosstalk, or cosmic particle inter- 
ventions [Yang92], Electrical-level simulation can be used to study the impact of transient faults from the physical 

Figure 3.1. Simulated Fault Injections at Different Levels 


Fault Injection 
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level, but since the simulation has to track the propagation of faults from circuits to gates, to functional units, and 
eventually out to the pins, it can be very time consuming and memory bound. 

For this reason, logic-level fault injection simulation applies abstractions of physical fault models to logic gates 
to study large VLSI, even computer systems. Commonly used fault models include stuck-at-O, stuck-at-I , and 
inverted fault. These models are considered to be representative of faults at the gate level. Although simulation at the 
logic level ignores the physical processes underlying gate faults, it still needs to trace the unpact of gate-level faults to 
higher levels. For the same reason that electrical-level simuladon cannot be effectively used to study large VLSI sys- 
tems, logic-level simulation cannot effectively study large computer systems. 

Function-level fault injection simuladon is usually used to study dependability features of large computer or 
network systems. Faults are injected into various components of the system under study. Funcdonal fault models are 
used in the simuladon, while detailed processes of fault occurrence at lower levels are ignored. Funcdonal models rep- 
resent the manifestadon of faults at the lower levels and are extracted from results obtained from electrical-level or 
logic-level fault injections or from field measurements. For example, "flipped memory bit" and "CPU register error" 
are two typical fault models. Analydcal dependability models of computer systems are usually built at this level. 
Compared to analytical models, simuladon is capable of representing detailed architectural features, real fault condi- 
tions, and inter-component dependencies, thereby providing more accurate and believable results. 

There are several common issues for fault injections at all levels. The first issue is that given fault models (e.g., 
one bit flip in memory) and types (e.g., transient fault), where do we inject faults? A simple way is to randomly 
choose a location from the injecdon space (e.g., all gates in a VLSI chip or all memory bits). This scheme is easy to 
implement, but many faults may have similar impact (e.g., all faulty bits in an ALU may have the same effect) and 
many faulty locations may not be exercised. Another way is to inject faults only to representative locations which 
have different impact, or only to representadve workload areas. This approach can be used to study fault impact in 
terms of locadons or workloads. 

The second issue involves workloads. The impact of faults on system dependability is workload-dependent. 
Hence it is important to analyze a system while it is execudng representadve workloads. These workloads can be real 
applications, selected benchmarks, or synthetic programs. If the goal of study is to invesdgate fault impact on a 
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mission task, the real applications running in the mission may be used in the simulation. It the research goal is to 
study fault impact on general workloads, several representative benchmarks may be selected for the simulation. It we 
want to exercise every functional unit and location in the simulation, both real applications and benchmarks may not 
be appropriate. In this case, synthetic workloads can be designed for achieving the goal. The workload issue further 
complicates simulation models and increases simulation time. It is necessary to develop ways to represent realistic 
workloads while still maintaining reasonable simulation times. 

The third issue is simulation time explosion which occurs when: 1) too much detail is simulated such as model- 
ing physical processes in fault injections at the electrical level, and 2) extremely small failure probabilities require 
large simulation runs to obtain statistically significant results (the theory is discussed in Section 2.1). Several tech- 
niques, including mix-mode simulation [Saleh90] [Choi92], importance sampling [Goyal92] [Choi92], hybrid simula- 
tion [Bavuso87], [Goswami93a], and hierarchical simulation [Goswami92] have been used to tackle the time explo- 
sion problem. 

Table 3.1 summarizes features and representative studies in simulated fault injections at different levels. We will 
discuss these studies in the following three sections. 


Table 3.1. Summary of Simulated Fault Injections 


Category 

Electrical Level 

Logic Level 

Function Level 

Approach 

Alter electrical current 
and voltage in circuits 

Inject stuck-at or inverted 
faults to logic gates 

Inject faults to CPU, 
memory, I/O devices, etc. 

Target 

Under 

Study 

VLSI chip 
Software running 
on the chip 

VLSI chip 
Computer system 
Software 

Computer system 
Network system 
Software 

Studies 

Fault simulation [ Yang92] 
HA1602 [Duba88] 
FOCUS [Choi92] 

BDX930 [McGough81 1 
BDX930 [Lomelino86] 
IBM RT PC [Czeck91] 

Trace-driven [Chillarege87] 
NEST [Dupuy901 
DEPEND [Goswami92] 
REACT [Clark93] 


3.1, Simulated Fault Injection at the Electrical Level 


There are several reasons for performing fault injections at the electrical level. First, the fault injection at this 
level can be used to study the impact of physical causes which lead to faults and errors. Secondly, it has been pointed 
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out by previous studies [Banerjee82], [Beh82] that simple stuck-at fault models do not represent some types of faults. 
Thirdly, some circuits are of a mixed analog/digital nature which cannot be fully characterized by logic-level fault 
models. Thus, there is a growing need for fault simulators which can handle electrical transient faults and permanent 
physical failures for the purposes of both circuit testing and dependability evaluation. 

The basic simulation methodology used in fault injections at the electrical level is the mixed-mode method in 
which the fault-free portions of the circuit are simulated at the logic level while the faulty portions of the circuit are 
simulated at the electrical level [Saleh90], Figure 3.2 illustrates the method. A simple CMOS AND gate with buffered 
output is drawn in the figure. The dotted boxes indicates normal voltage waveforms for the circuit and the dashed 
boxes contain waveforms resulting from a transient injection at the location marked by X. Notice that waveforms 
within the electrical-level analysis behave in an analog fashion, but are discrete in the logic-level analysis. 

Figure 3.2. Illustration of Fault Injection at Electrical Level 
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A representative mixed-mode fault stmulator is SPLICE1 [Saleh84]. The e.ectncal analysis in SPLICE1 is 
based on the method of iterated dming analysis (ITA) which incorporates a nonlinear relaxation method with event- 
driven selective tracing. ITA has been shown to be accurate and fast (can provide a speed-up of up to two orders of 
magnitude). The logic analysis in SPLICE1 is performed using a relaxation-based method including MOS-onented 
models. Recendy, more advanced techniques, such as the concurrent mixed-mode simulation of permanent faults and 
the dynamic mixed-mode simulation of transient faults have been developed [Yang92], 

We now discuss two studies in the electrical-level fault injection. Both use SPLICE1 as the fault simulation 
engine. The first is a case study of the impact of different levels of current transients on a microprocessor-based chip. 
The second is a fault injection tool which integrates fault injection engine, tracing facility, and graphical and statistical 

analysis packages into a user environment. 

3.1.1. Simulation of a Microprocessor-Based Chip 

One of the studies in this field was an experimental analysis of susceptibility of a microprocessor- based jet 
engine controller to upsets caused by current and voltage transients through simulated fault injections [Duba88], The 
target system for the study was an HA1602, a microprocessor-based digital jet-engine controller designed by Hamilton 
Standard for commercial aircraft and made available to NASA Langley AIRLAB. SPLICE1 was chosen for the fault 
simulation in the study. A number of enhancements to SPLICE1 were made to facilitate the fault injection Simula- 

tions. 

The parameters used in the simulations were extracted from those used in the HA1602 design and circuit layout. 
The application code running on die simulated processor was chosen such that all the functional units at which tran- 
sient fault injections were made were exercised. Fault injections were made at seven randomly chosen nodes in six 
functional units. For each node, current transients were injected at five different charge levels: 0.5, 1.0, 2.0, 3.0. and 
4.0 pico Coulombs. Each charge level was injected at five different time-points during the execution of die application 
code sequence. This amounted to over 1000 fault injections/simulations. 

The error data was generated by comparing each faulted simulation with a fault-free simulation. An error event 
was defined as either a logic state change or a voltage level change large enough to cause a node to be faulted at a 
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Table 3.2. Severity of Injected Transient Faults 


Error Category 

Occurrences Percentage Charge Threshold 

Injected Transients 
Logic Upsets 
Latched Error 
Pin Errors 

1050 100.0 3.0 pC 

437 41.6 3.0 pC 

60 5.7 3.0 pC 

59 5.6 3.0 pC 


future time. Error events were classified as three categories: 1) logic upsets — voltage transients large enough to con- 
stitute logic level errors, 2) latched errors — errors in the first-level latches, and 3) pin errors — errors at the chip I/O 
pins. The overall results from the experiments are shown in Table 3.2. It can be seen that the injected transients have a 
41.6% chance of causing a logic upset (no errors), a 5.7% chance of resulting in a latched error (a latent error in the 
circuit), and 5.6% chance of error propagating to pins. The other 47% of the injected transients have no impact on the 
processor. Thus, only 11% of all injected transients cause either a permanent change in circuit behavior or affect the 
external environment. The table shows that transients below 3.0 pico-Coulombs have no significant impact on the cir- 
cuit. 

The study also investigated the impact of current and voltage transients occurring in the different functional 
units of the processor. An ALU transient was found to most likely result in logic upsets and pin errors. Further, the 
analysis of variance (ANOVA) technique was used to quantify the sensitivity of pin-level errors to error activity in the 
different functional units. The results of ANOVA are shown in Table 3.2, which indicate that the output pin errors are 
most sensitive to error activity in ALU. 

3.1.2. FOCUS — A Chip-Level Simulation Environment 

FOCUS is a simulation environment, developed at University of Illinois, for fault sensitivity analysis of IC 
chips [Choi92], In the environment, a range of user-specified faults are automatically injected at the circuit level, and 
fault propagation is measured at the gate and higher levels. Figure 3.3 depicts the overall experimental environment. 
The environment takes as input a net-list of the hardware description of the system and converts it into a simulation 
model. SPLICE1 is used as the fault simulation engine. The importance sampling technique, which has been intro- 
duced in Section 2.4, is used in FOCUS to accelerate simulations. 
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Figure 3.3. FOCUS Experimental Environment 



Th e fault injection process is implemented as a run-time modification of the circuit, whereby a current source is 
added to a target node, 1 thus altering the voltage level of the node over the time interval ot the injected current wave- 
form. The experimental environment allows both transient and permanent (single or multiple) fault injections. Since 
the injected current source is specified as a mathematical function, the resulting transients can be of varying shapes 
and duration. For example, electrical power surge, in-chip alpha particle intervention, lightning, and bridging faults 
can be modeled. The user can control the location of a fault, the time and duration of a fault, and the shape ot the cur- 

rent source. 

The tracing facility monitors all switching activities in the target system, including tault propagation through 
each gate or transistor, for all processed events. The trace data for each event consists ot the ume ot the event, the 
hierarchical node name, and the new and previous voltage levels (for electrical nodes) or logic levels (for logic nodes). 

'A node is defined as a point in a conductive interconnection between electrical and/or logical elements. 
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The graphical amlysis facilir, is used to visualize the mo, acuv.ty m dilferent luncttona ^ ^ 

am, the rauli propagation on the m*or interconnects and a, the externa, pins. The — ^ ^ 
impact analysis of the target system and generate necessary models to depict the fault behavior in the system 

p,„ error distnhutton. latch error distribution, and interna, Ml propagahon mode,,. 

The application 0, POCUS is Ulusuated b, stndytng a .get system. The target system ,s a m, coprocessor 
used m commerce turcrali lor real-time control of .et-engtoe — The ,.h„ m,roproc„stsm, - 


latch, pin, and functional levels. 

Netnl, SO instruct™ cycles <90300 nanoseconds) of the application code were executed on the targe, system 
during each simulation run. The application code was carefully selected to ensure that all of die functional units were 

Figure 3.4. The Target Microprocessor System 
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Table 3.3. Impact of Transients Injected to the Target System 


Type 

Injections Involved 

Percent of Total Injections 

Resultant Errors 

First-Order Latch Errors 

470 

22.4 

2149 

Second and Higher-Order Latch Errors 

120 

5.7 

1829 

First-Order Pin Errors 

255 

12.1 

1168 

Second and Higher-Order Pin Errors 

90 

4.3 

839 

Functional Errors 

193 

9.2 

747 


executed. A total of 2100 simulations were performed for obtaining stable results. During the simulation, all nodes 
(including all latches and external I/O pins) in the circuit were monitored and processed. Table 3.3 summarizes the 
overall impact of transients in the range 0.5 to 9.0 picoCoulombs. In the table, a first-order error is defined as one 
which occurs during the first clock cycle following a transient fault injection; second and higher-order errors are those 
occurring during the second and subsequent clock cycles. 

Figure 3.5 shows the propagation of the latch errors in time. In the figure, the x-axis represents the clock cycles 
from the fault injection time, and the y-axis represents the total latch error count for each clock cycle. It can be seen 
that, given a certain number of latch errors in the first clock cycle, the number of latch errors degenerates significantly 
until the fourth clock cycle. At approximately the fifth clock cycle, the number of errors rapidly multiplies. This is 
because at this time, the error signal enters a unit with a large number of latches and high fan-out, e.g., the ALU regis- 
ters. After the sixth cycle, the number of errors degenerates significantly until finally disappearing alter the eighth 
cycle. Thus, the impact of latch errors lasts at most up to 8 clock cycles from the time of fault injection. 


300 

200 

Frequency 

100 


Figure 3.5. Latch Error Occurrence in Time 
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3.2. Simulated Fault Injection at the Logic Level 


Simulated fault injection at the logic level is similar to that at the electrical level in that they are both circuit- 
level simulations. The difference is in the fault models used. In the electrical-level injection, physical fault models are 
used, while in the logic-level injection, logic fault models are used. Logic-level fault simulation uses abstract logical 
models for both faults and circuit functions to evaluate the behavior of a system. In contrast to the evaluation of the 
physical models used in the electrical-level simulation, logic-level simulations perform binary operations that repre- 
sent the behavior of a given device. They take binary input vectors and to evaluate the output of the device for a given 
input pattern. Each signal in the circuit is represented by a member in a set of boolean values depicting the steady 
state conditions of the physical circuit. For example, set {1,0,X} is often used to describe high, low, and unknown 
voltage values for logic gates. Fault injection at this level simply forces a node to either stuck- at- 1 or stuck-at-0, or it 
inverts a logic value. Fault injection location and time can be set arbitrarily. Hence, with logic simulation, one 
obtains outputs with discrete values and possibly with some approximate timing information. Typically, outputs of the 
faulty and non-faulty systems are compared to determine whether a fault has been detected. 

For MOS technology, a gate-level logic simulator is inadequate to handle circuits containing pass transistors, 
ratioed logic, buses, and other features that exhibit bidirectional signal flow and/or charge-sharing effects. To handle 
such transistor networks without resorting to expensive electrical-level analysis, switch-level simulation is proposed in 
[Bryant84]. Switch-level analysis allows bidirectional signal flow and different levels of signal strengths. For exam- 
ple, a discrete set {0,1, ..,9} can be used to model different signal strengths or voltage levels. At this level, transistor- 
level fault modeling can also be incorporated easily. 

Fault simulation has been widely used tor evaluating the coverage of a given set of test vectors for testing man- 
ufacturing defects in a chip. Typically a set of test vectors generated either by an automatic test pattern generator 
(ATPG) randomly, or manually is submitted to the fault simulator in order to decide how many faults can be detected 
by the test vectors. In this case, the generated test vectors become workloads or inputs to the system. In the begin- 
ning of such a simulation, a stuck-at fault is injected, and the faulty circuit is simulated while a given test is applied at 
the primary inputs of the circuit. A similar run is performed again without any fault injection. The logic values at the 
primary outputs of the faulty circuit are then compared to the outputs of the fault-free run to determine if there is any 
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To evaluate system dependability based on realistic fault models, a fault-dictionary approach can be taken 
[Choi93]. A fault-behavior dictionary generated from electrical-level fault analysis can be used as a fast look-up table 
for a logic-level concurrent or parallel fault-injection simulation. First, an electrical- level fault-behavior dictionary for 
a given chip can be generated by extensive fault simulations. In this step, gates around the fault-injection location are 
extracted, and a subcircuit consisting of these gates is formed. This subcircuit is exercised by exhaustively applying 
all input combinations while fault injection is performed. Faulty behavior at each of the subcircuit outputs is analyzed 
and recorded in a dictionary. The resulting entry in the dictionary consists of the input vector, fault-injection time, and 
fault location. Second, concurrent run-time fault injections of the generated logical error at the subcircuit level using 
the fault dictionary can be performed. The concurrent simulator is used to propagate, in a single simulation pass, the 
effects of the injected errors. 

Both transient and permanent faults can be injected using switch-level or gate-level logic simulation. The over- 
all simulation approach for fault injections at the logic level consists of the following steps: 

(1) Obtain the net-list of a design and devise appropriate simulation models to emulate the given design. 

(2) Simulate the model using a logic-level simulator. 

(3) During the simulation, run a given workload depicting the application or test software (by applying test vec- 
tors to the primary inputs). 

(4) Save the behavior of the system under fault-free conditions by tracing all the changes in the evaluated logic 
events of monitored nodes for comparison with the subsequent fault-injection runs. 

(5) Run the same workload again and inject a fault to a selected node during the simulation period and trace. 

For a stuck-at fault: Force the state of the selected node to either 1 (for stuck-at-1 fault) or 0 (for stuck-at-0 
fault) and evaluate the circuit. Hold the state to stuck-at fault value throughout the simulation. 

For a transient fault: Force the state of the selected node to a logic value that is the reverse of the normal state 
(i.e., force a 0 if the normal state is a 1, and vice versa). Hold the state to the reverse value on the node for one 
or more clock cycles. Let the fault effect propagate by evaluating the circuit with the corrupted logic state. 
Release the forced state when a new signal/event arrives at that node. 
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(6) Monitor the behavior of the system under fault conditions. 

(7) Compare the traces from the faulty and fault-free runs and identify the differences to determine where and 
when the fault has propagated. 

(8) Use collected statistical measurements to determine dependability parameters (e.g., detection coverage) and 
the fault impact (e.g., minor program error or system failure). 

The above fault injection steps should be repeated a large number of times for a given workload. If the experi- 
ment is intended to estimate single measures (e.g., detection coverage), the number of injections required tor achiev- 
ing a given confidence interval can be determined using Eq. (2.14). If the experiment purpose is to obtain distributions 
(e.g., error latency distribution), the fault injections should not be stopped until the constructed distribution is stable, 
i. e ., the two consecutive distributions constructed are not different statistically. Importance sampling can be used to 
significantly reduce simulation runs. 

Two early studies in fault injections at this level took a digital avionic miniprocessor, BDX-930, as the target 
system. The first study investigated the impact of faults at gates and pins on the output results ot programs, with 
emphasis on the fault latency and fault coverage issues [McGoughSll. The second study investigated error propaga- 
tion from the gate level to the pin level [Lomelino86], A recent study explored the behavior of transient faults which 
occur during the normal execution of a processor [Czeck91], The study quantified faults that can be emulated by soft- 
ware-implemented fault injections (to be discussed in Section 4.2). We discuss these studies in the following two sub- 

sections. 

3.2.1. Study of Bendix BDX-930 

An early study in this field was the simulated fault injection to the Bendix BDX-930, a digital avionic mimpro- 
cessor, to investigate fault latency and coverage [McGoughSl]. The BDX-930 was composed of bit-slice processors 
(AMD2901) and was used in a number of flight control avionic systems. Fault tolerance was achieved by replication 
of the processing and voting in software. A gate-level emulator of the BDX-930 was developed for this study. The run 
speed of the emulator was 25,000 times slower than the BDX-930 when hosted on a PDP-10. 
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The methodology used in the study was: Given a software program running on the processor, inject a single 
stuck-at or inverted fault at a gate or pin selected randomly and observe the time to detection, assuming that a detec- 
tion occurs whenever there is a difference between the outputs of the faulty and fault-free processors executing the 
same program. The experiment was repeated 600 to 1,000 times to construct an empirical latency distribution. Six 
different programs were selected to repeat the above experimental procedure. In addition, a typical avionic flight con- 
trol system self-test program was written for this study and executed to determine fault detection coverage. 

Results showed that most detected faults were detected in the first repetition of the program. Subsequent repeti- 
tions did not significantly increase the propagation of detected faults. A large percentage of faults (about 60% for the 
gate-level faults and 30% for the pin-level faults) remained undetected after as many as eight repetitions of the pro- 
gram. The fault coverage of the seif-test program was found to be 87% for the gate-level faults and 98% for the pin- 
level faults. 

The above study emphasized the impact of faults at gates and pins on the output results of programs. Another 
study on the Bendix BDX-930 computer investigated error propagation from the gate level to the pin level 
[Lomelino86]. In this study, a single AMD2901 processor chip in the BDX-930 was selected for fault injection and 
error data collection. The processor was simulated using an event-driven, gate-level logic simulator developed at 
NASA Langley [Migneault85]. The simulator was driven by a self-test program, developed for the BDX-930, which 
provides a high probability of detecting error activity. 

In the simulation, the single stuck-at fault model was applied to 150 selected gates for fault injection. The gates 
were distributed throughout the nine function units of the AMD2901. Error data was collected by first simulating a 
fault-free circuit, then simulating the circuit with a single injected fault, and finally comparing the two simulation out- 
put for obtaining differences. Four sets of simulation experiments, consisting of 150 simulations per set, were con- 
ducted. Results showed that 78.7% of faults produced error propagation detected within the chip and 66.7%> of faults 
produced errors that propagate to the output pins, within the first 100 clock cycles. The error distribution at the output 
pins was found to be sensitive to the locations of faults. The results also showed that the error activity increases with 
the increase of concurrent microinstruction activity. 
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3.2.2. Study of IBM PC RT 


In [Czeck91], a simulation model of the IBM RT PC was used to inject transient, gate-level faults tor exploring 
the behavior of transient faults which occur during the normal execution of a processor. The emulated hardware func- 
tional units in the processor included: instruction prefetch buffer (IPB), microinstruction fetch (MIF), data fetch and 
storage (DFS), ALU and shifter (ALU), and ROMP storage channel interface (RSCI). Both original error detection 
mechanisms (EDM) which reside in the IBM RT PC (such as illegal instruction traps and memory access violation) 
and additional error detection mechanisms which are provided in this study for evaluating their effectiveness (such as 
timeout and control flow monitoring) were included in the simulation model. 

Figure 3.6 shows possible error manifestations after a fault injection. In the figure, minor errors are differences 
in the internal processor state between the faulty and fault-free simulation runs, which have not been detected by an 
EDM. Monitoring errors are those which are uncovered by the "duplicate and compare" EDM which monitors bus 
addresses and data. Severe errors are those resulting in the change of a microinstruction and the instruction address 
registers, which lead to a change in the control flow of the program. Fatal errors have triggered a system resident 
EDM and caused an abnormal termination of the application task. Results overdue occurs when the task executes 
longer than a predetermined time limit and the execution is halted. Overwritten means that the injected fault does not 
manifest to a minor error or a minor error is overwritten by correct data. 


Figure 3.6. Error Manifestations in the IBM PC RT Simulation Model 
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Three workloads were selected for this study: an iterative matrix multiplication, a recursive Fibonacci program, 

and an iterative Fibonacci program. These workloads were considered to be representative of the characteristics of 

instruction set architectures and diversity in program structure. For each workload and each fault location, 1000 faults 

were injected. Following is the method of the experiments: 

(1) For each workload, the fault-free behavior of the workload is extracted from the internal state of the processor 
and saved for comparison during the subsequent fault injection experiments. 

(2) A fault location is selected such that the fault in the location has a high probability of producing an error and 
locations for different injections do not yield the same error behavior. 

(3) The fault injection time is set to the start of the workload execution. The fault injection time will be advanced 
by one cycle for each successive fault injection experiment. 

(4) The fault is injected for a duration of one clock cycle at the location and time selected. 

(5) For each successive clock cycle, the internal processor state of the faulty processor is compared with that 
obtained in step 1. Differences are recorded for off-line analysis. 

(6) The faulty behavior is monitored at each clock cycle until the program execution is completed or a severe 
error causes the monitor to cease. 

(7) The simulation run is restarted at step 3 and the time of next fault injection will be advanced by one clock 
cycle. 

Results of the study include: 40% to 55% of injected faults do not produce an error. Among the faults that mani- 
fest to errors, approximately 2/3 ol them can be emulated by the software-implemented fault injection approach (to be 
discussed in Section 4). The other 1/3 of these faults manifest to errors in CPU components (e.g., microinstruction 
control registers) that are not accessible to software. At the system level, the fault behavior showed a strong depen- 
dency on the workload structure and instruction sequencing rather than the instruction mix. Error detection latency 
was tound to follow a Weibull distribution with a decreasing detection rate. The distribution represents two error 
occurrence processes: a last process in which fault manifestation and error propagation occur within a small time win- 
dow and a slow process in which dormant faults and errors are activated gradually by the workload. 
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3.3. Simulated Fault Injection at the Function Level 


Function-level fault injection simulation is used to study complete computer and network systems rather than 
the components of which they consist. These studies typically consider the hardware, the software, their interactions, 
and the inter-dependence between the various components of the system. There are at least five issues in developing 
functional simulation models at the system level. 

The first is a lack of well-established system level fault models. This is partly due to the second issue: a large 
and varied component domain. At the gate level, the basic components are gates with single functions and well- 
defined interconnections. At this level, it is possible to establish a fault model, such as the single stuck-at fault model 
which can be consistently applied to all gates to model their fault behavior. At the system level, the basic components 
include CPUs, communication channels, disks, software systems and memory. The components have complex inputs, 
perform multiple functions, have varied physical attributes (e.g. hardware and software) and complex interconnec- 
tions. In addition to the diversity of the components that make up a system, two similar components (such as two 
CPUs) can have different functions and behavior. This makes it difficult to establish a single fault model that can be 
applied consistently to all components. 

For this reason, various types ot fault models are required and will depend upon the type of component beine 
injected. The fault models are functional fault models that simulate system-level manifestations of gate-level faults. 
For instance, a single bit-flip is typically used to simulate a memory or register fault. Various fault models can be 
used tor communication channels. Messages traversing the channel can be corrupted or destroyed, or the channel can 
be made inoperative. A fault in a processing node can be modeled as a service interrupt caused by CPU, memory, 
disk, or software faults in the node. More detailed fault models for a processor or other system components can be 
derived from lower-level simulations using the fault-dictionary approach discussed in Section 3.2. For instance, a 
gate-level simulation of a processor can be injected with faults while executing a typical workload. The effect of the 
faults on the workload can be stored in a fault dictionary that contains, tor each gate-level fault injected, the types of 
effects and the probability ot these effects. This dictionary can then serve as a fault model for system level simula- 
tions. 
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the failure mechanisms, obtain failure probabilities, and quantify the effect of faults. It can be used to pick out the key 
features that must be modeled and help to determine and specify both the structure of, and the parameters to, analyti- 
cal models. 

A single distinguishing feature between probabilistic modeling and behavioral modeling is brought out by one 
of the results of this study (details of all the results can be found in [Goswami93b]). The study helped to uncover a 
design feature of the software that caused erratic increases in system response time only when status messages were 
destroyed. Once the software was modified, the erratic increase in response time ceased. Clearly, such results cannot 
be obtained with analytical modeling tools. 

An additional advantage of functional simulation tools is that they allow the use of any type of TTF distribu- 
tions. Unlike analytical modeling, in which only a few types of distributions are commonly used for the tractability of 
models, the simulation method can handle any form of distribution, empirical or analytical. 

An early study used a trace-driven simulation approach to analyze error latency [Chillarege87]. The approach is 
based on sampled data of physical memory activity gathered, through hardware instrumentation, from a computer sys- 
tem running normal workloads. The data are then used for a trace-driven simulation in which faults are inserted into 
the trace to emulate fault occurrence and error discovery processes in the system. The approach provides a means to 
study error latency in memory systems under real workloads. 

In recent years, several function-level simulation tools that can be used for fault injections have been or are 
being developed. NEST, DEPEND, and REACT are three representative tools. REACT, a software testbed that per- 
forms automated life testing of a variety of multiprocessor architectures through simulated fault injections, is being 
developed at the University of Massachusetts [Clark93]. Several system, workload, and fault/error models, which tire 
representative of multiprocessor architectures and conditions, are embedded in the testbed. The tool can be used to 
evaluate system reliability and availability metrics. Preliminary versions of the software have been reported to be suc- 
cessfully employed in several studies of multiprocessor systems [Clark931. 

NEST is a function-level testbed that specializes in modeling and evaluating distributed network systems 
[Dupuy90]. Although the tool is not designed for fault injections, users can make node or link failures by deleting or 
adding nodes and links or changing their features while the simulation is running. DEPEND, developed at the 
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University of Illinois, exploits the properties of the object-oriented paradigm to provide a general-purpose, system- 
level dependability analysis tool that can evaluate various types of fault tolerant architectures |Goswami92|. The 
object-oriented feature of DEPEND makes the tool capable of modeling multiple levels of functional units to meet a 
wide range of applications. The next two subsections discuss NEST and DEPEND, respectively. 

3.2.1. NEST — A Network Simulation Testbed 

The NEtwork Simulation Testbed (NEST) is a graphical environment, running on the UNIX system, for model- 
ing, executing, and monitoring distributed network systems and protocols [Dupuy90], Using a set ot graphical tools 
provided by NEST, the user can develop simulation models of communication networks. The model includes node 
functions (e.g., routing protocols) and communication link behaviors (e.g., packet loss or delay features), typically 
coded in C. These user procedures are linked with run-time routines embedded in NEST and executed by the NEST 
simulation server. The user can reconfigure modeled network system through graphical interaction or programming. 
Built-in graphical tools allow users to programming custom monitors to observe the simulation results on-line. 

Figure 3.9 shows the overall architecture of NEST. NEST consists of a simulation server and several client 
monitors. The simulation server is responsible for running simulations. The generic client monitors are used to config- 
ure simulation models and control their executions. The custom client monitors are used to observe simulation behav- 
ior and display results. Clients can reside on separate machines so that the server is dedicated to time-consuming sim- 
ulations. 

Node functions are used to model distributed communicating processes running at network nodes (e.g., proto- 
cols and database transactions). NEST executes node processes and their communication calls using a set of embed- 
ded primitives for sending, broadcasting, and receiving packets. The motion of a packet over links is simulated by 
passing it through the link functions. Link functions are used to model the behavior of communication links (e.g., 
packet loss and link jamming). Link functions are also used to monitor and collect performance statistics of link traf- 
fic. The simulation server schedules the execution of the node and link processes to meet the delay and timing speci- 
fied by the user. 
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Figure 3.9. Overall Architecture of NEST 
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The user can create and modify a network description (node and link functions and connections) using the 
NEST graphical tools. Once the user has defined a simulation scenano, it is sent to the simulation server to be exe- 
cuted. One of NEST’s key features is its ability to reconfigure a scenario during the simulation run. The user may 
delete or add nodes and links (thus failures can be emulated) or change their features while the simulation is running, 
nte impact of these changes may be instantly observed and interpreted. Such dynamically reconfigured simulations 
can be used to study the impact of node/link failure and recovery on the modeled network system. 

3.2.2. DEPEND — A System Dependability Analysis Environment 

DEPEND is an integrated design and fault injection environment [Goswami92I. It provides facilities to rapidly 
model fault tolerant architectures and conduct extensive fault injection studies. It is ideally suited for evaluating spe- 
cific fault tolerant mechanisms, detailed fault scenarios such as latent errors, and software behavior due to hardware 
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faults. It is a functional, process-based [Kobayashi78], [Schwetman86] simulation tool. The system behavior is 
described by a collection of processes that interact with one another. A process-based approach was selected tor sev- 
eral reasons. It is an effective way to model system behavior, repair schemes, and system software in detail. It facili- 
tates modeling of inter-component dependencies, especially when the system is large and the dependencies are com- 
plex, and it allows actual programs to be executed within the simulation environment. Both hierarchical and hybrid 
simulation techniques have been used in DEPEND. 

DEPEND exploits the properties of the object-oriented paradigm, specifically, modular decomposition and mod- 
ular composability [Meyer88], to model different levels of components and to implement a variety of fault models. 
Modular decomposition consists of breaking down a problem into small elements, whereas modular composition 
favors production of elements that can be freely combined with each other to provide new functionality. It, tor 
instance, the fault injection process is divided into two elements or objects: an object that determines when to inject 
and interrupt the system, and an object that determines the response to a fault (the fault model), then the two criteria 
are met The first object is common to all fault injection methods. It encapsulates the various mechanisms used to 
determine the arrival time of a fault and interrupt the system. The second object is the fault model and is specific to 
the component being injected and the type of fault injection study. The two are combined via function calls. Thus, by 
specifying different fault model objects, one injector object can be used for all types of tault injections. Key objects, 
such as the injector object, are designed to be parameterized. That is, the user can specify various tault arrival distri- 
butions or trace files. This same approach is used to model components that are similar but not identical; common 
aspects are encapsulated in an object which then invokes other objects to provide more specific functionality. Further- 
more, because users can specify specific behaviors (e.g. their own tault model objects), the tool is not limited to any 
predefined set of fault models or component types. 

A library of objects that provide the skeletal foundation necessary to model an architecture and conduct simu- 
lated fault-injection experiments is provided. This reduces the development time and effort needed to build simulation 
models. In addition to decomposition, composition, and parameterization, the concept ot inheritance [Meyer88] 
makes it possible to provide a library with a minimum set of objects that can be readily specialized to model a wide 
gamut of different architectures and fault injection experiments. With inheritance, users can inherit the properties of 
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Table 3.4. Some Objects Provided in DEPEND 


Name 

Type 

Description 

Active_elem 

Elementary 

Simulates a basic server. Several disciplines: first come 
first serve, round robin, etc. 

Injector 

Elementary 

Injects faults using distributions, trace files and workloads. 

Checksum 

Elementary 

Computes checksums. 

Fault Report 

Elementary 

Compiles and displays fault statistics. 

Voter 

Elementary 

Simulates a basic voter with timeout. 

Server 

Complex 

Simulates a server with spares. Three policies: no spare, graceful 
degradation, stand-by sparing. Automatic repair and reconfiguration 

Link 

Complex 

Simulates communication channels. Several fault types: link dead 
packet corruption, packet loss, and user defined faults. 

NMR 

Complex 

Simulates dual self-checking, triple-modular redundant and 
N-modular redundant components. 

Fault Manager 

Complex 

Simulates software fault management schemes. Logs faults 
and shuts off components which exceed their fault threshold. 


an existing object and develop more specialized objects with minimum effort. Table 3.4 briefly describes some of the 
major objects in the DEPEND library. Elementary objects provide basic functions, such as injecting faults and com- 
piling statistics. Complex objects created from several elementary objects simulate fundamental components found in 
most fault tolerant architectures such as CPUs, self-checking processors, N-modular redundant processors, communi- 
cation links, voters, and memory. 

The steps required to develop and execute a model are shown in figure 3.10. The user writes a control program 
in C++ using the objects in the DEPEND library. The program is then compiled and linked with the DEPEND 
objects and the run-time environment. The model is executed in the simulated parallel run-time environment. Here, 
the assortment of objects including the fault injectors, CPUs, and communication links execute simultaneously to sim- 
ulate the functional behavior of the architecture. Faults are injected and repairs are initiated according to the user's 
specifications, and a report containing the essential statistics of the simulation is produced. 

DEPEND allows users to specify different fault models. In addition, DEPEND provides default fault routines 
for each object to minimize user design time. For instance, the default fault model for a communication medium sim- 
ulates the effects of a noisy communication channel. Fields in the messages passed along the communication link are 
actually corrupted or the message is destroyed. 
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Figure 3.10. The Depend Environment 



The fault injector is a fundamental object of DEPEND (Figure 3.11). It encapsulates the mechanism for inject- 
ing faults. To use the injector, a user specifies the number of components, the TTF distribution for each component, 
and the fault subroutine that specifies the fault model. In addition to user-specified distributions, the injector provides 
constant time, exponential, hyperexponential, and the Weibull distributions. The injector also provides a workload- 
based injection scheme that varies the fault arrival rate based on a specified workload. The user provides a workload 
function, a set of workload states, and an exponential fault arrival rate for each state. For example, the workload func- 
tion may be the utilization of a server. With this approach, the fault arrival rate will fluctuate with the utilization of the 
server. The fault injector will periodically poll the workload function to update a state transition diagram to maintain 
a history of the workload behavior. This history is used to inject a large number of faults during peak workload condi- 
tions and fewer faults when the workload is low. This technique models the workload/failure dependency observed in 
[Iyer80J and [Castillo81 ]. 

In addition to executing actual C++ and C programs, DEPEND provides an abstract software modeling environ- 
ment to simulate program behavior during the early design stages when actual code does not exist. The environment 
represents application programs by decomposing them into graph models consisting of a set of nodes, a set of edges 
that probabilistically determine the flow from node to node, and a mapping of the nodes to memory. The graph 
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Figure 3.11. The Fault Injector Object 



models are then mapped to virtual memory and executed while errors are injected into the program’s memory space. 
The environment provides application-dependent parameters, such as detection and propagation times, and permits 
meaningful application-dependent evaluation of function- and system-level error detection and recovery schemes. 
This environment has been used to analyze memory-scrubbing schemes within the context of application programs 
[Goswami93c]. The application-dependent coverage values obtained were compared with those obtained by tradi- 
tional schemes that assume uniform or random memory access patterns. The coverage values obtained using the tradi- 
tional approach were found to be up to 100% larger than those obtained with the software graph model. The findings 
demonstrate the need for application-dependent evaluation — especially when evaluating the dependability of applica- 
tion-specific systems. 

DEPEND has been applied to evaluate several computer systems. In [Goswami91] and [Goswami92], 
DEPEND was used to simulate the UNIX-based Tandem Integrity S2 fault tolerant system and evaluate how well it 
handles near-coincident errors caused by correlated and latent faults. Issues such as memory scrubbing, reintegration 
policies, and workload-dependent repair time were evaluated. The accuracy of the simulation model was validated by 
comparing the results of the simulations with measurements obtained from fault injection experiments conducted on a 
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production Tandem Integrity S2 machine. DEPEND has also been used to study the CM5 connection machine, the 
Parsytec high-performance computer being developed by the European Esprit project, the Space Station Data Manage- 
ment System, and the computing element of the Hubble Telescope. 


47 



IV. PROTOTYPE PHASE 


In the prototype phase of the development of fault tolerant systems, physical fault injection can be used to eval- 
uate fault, error, failure, and fault tolerance characteristics of the developed systems. Normally, fault injection can be 
applied only to fault tolerant systems, because the injected faults, if activated, would almost always crash the target 
system without fault tolerant mechanisms. However, fault injection can also be used in non-fault-tolerant systems if 
the system control flow can be well traced and the system state information can be obtained when it crashes because 
of injected faults. 

A fault injection environment typically contains the following components: target system, controller and moni- 
tor, fault injector, data collector, and data analyzer, as shown in Figure 4.1. The target can be a VLSI chip, a computer 
system, or a network system. When faults are injected into the target, either benchmarks or synthetic workloads 
should be running on the target to emulate real workloads. The controller is a special software program, sitting on the 
target or on another computer, which controls the overall fault injection experiments. The fault injector implements 
fault injections into the target. The monitor keeps track of normal and abnormal executions of the workload and initi- 
ates data collection whenever necessary. The data collector and analyzer perform on-line data collection and off-line 
data processing and analysis, respectively. 


Figure 4.1. Components in a Fault Injection Environment 
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Table 4. 1 . Categories of Fault Injections 


Category 

Hardware-Implemented 

S oft ware- Implemented 

Radiation-Induced 

Approach 

Inject faults to IC pins 
by hardware instrumentation 

Inject faults to components 
by special software 

Inject faults by applying 
radiation rays to target 

Advantages 

No disturbance to workload 
High time resolution 

Flexibility 
Low cost 

Can induce transient faults 
inside IC evenly 

Disadvantages 

Limited access points 
High cost 

Workload disturbance 
Low time resolution 

Fault injection points 
are uncontrollable 

Studies 

FTMP [Lala83] 

FTMP [Shin84], [Shin86] 
FTMP [Finelli87] 
MESSALINE [Arlat90] 

Accelerated Injection [Chillarege89] 
FIAT [Segall88], [Barton90] 
FERRARI [Kanawati92] 

HYBRID [Young92] 

FINE [Kao93] 

Z-80 [Cusick85] 
MC6809E [Karlsson89] 
MC6809E [Gunneflo89] 


The fault injector can be implemented by hardware, software, or radiation. Correspondingly, fault injection can 


generally be divided into three categories: hardware-implemented (or hardware) fault injection, software-implemented 
(or software) fault injection, and radiation-induced (or radiation) fault injection. Table 4.1 lists features and represen- 
tative studies in these categories. The monitor can also be implemented by hardware, software, or both (hybrid). If 
the fault injector is implemented by software and the monitor is implemented by hardware or by both hardware and 
software, the system is called a hybrid fault injection environment. The following three sections discuss in detail each 
type of fault injection. 


4.1. Hardware-Implemented Fault Injection 

Hardware-implemented fault injection is a method of introducing faults in the hardware of a computer system 
with the aid of additional hardware instrumentation. The method is well suited for studying dependability characteris- 
tics which require high time resolution, such as fault latency in the CPU, which cannot be easily achieved by other 
fault injection methods. For example, the occurrence of software- implemented faults is restricted by the system clock 
(i.e, the injections must occur synchronously). Detections are similarly restricted by the system clock, unless an exter- 
nal hardware monitor is used. Two main techniques are used to accomplish hardware-implemented fault injections. 

The first approach involves the use of active probes attached to the desired hardware injection points. The cur- 
rents through these injection points can be altered, thereby influencing the corresponding logic values. The types of 
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taults attainable with probes are usually limited to stuck-a 


-at faults. However, it is also possible to introduce bridging 
lints. Care must b 

force values onto injection points, because damage to the target hardware 


laute b, placing ibe probes acrass mnldple hardware poims. Care must be lata, w,lh the use or active probes tha, 


current. 


can result due to an inordinate amount of 


The second ,ech„„„e tnvolves the insenton of addidonai hanlware huo dre mrget computer system. Where» 
- medtod nses acdve probes which are external the mrge. system. this method mtrodnces addtdona, hmd- 
warn, which becomes par. of dte large, system. The most common approach rerpmes dre tnterpoladon of a socket 
between a chip and the cinsntt board, Thts socket has the capability to inject si 
chip p,„ is essendaily tri-slated. In addidon, more complex log, cal fa 
the pm signals can be inverted, or ANDed or ORed 
same pin. 


inject stuck-at faults or open faults, where the 
faults can be forced onto these pins. For instance, 
with adjacent pin signals or even with previous signals on the 


hi theory, die domam of possible injecdon locadons is limited only by die physical consiraims of die dirge, sys- 
tem dm. prevent die mdodncdon of probes or odier hmdware. Since me system , nsnrdly a complete prototype 


computer system, fault injection below the chip pin level 


pins of chips. In addition, active probes can be attached 


is impractical. Thus, the focus of most injections are the 


nal lines. 


to certain circuit board locations, such as buses or other sig- 


in mhhdon to the range o, possible Injecdon Icwadons, a major com*™ any lhn„ in jecdo„ envnonment.s me 
“ ,VPCS " m0<Ie,S ““ " “ ^ ^ W- - fame, achievable „,th probes and 

- o, ihe tanlt, which can he ether pemranend transient. or inte,m„.end p eraraieM fau|K simp|y ^ ^ 

ncuve penorl, atter which they are removed. Thns. die possihdlty exists dia, die dan»,e„, fan,, may neve, even be 
latched a chip ,i.e„ the fan,, neve, prodnees an ermr), especially if die acdve period is less than die system Cock 
ponod on a synchrony machine. Intermittent fanlts are miecied me same manner as Indent fa„l, bn, mev are 

also repeated, eimer randomly or according ,o some fnnedon. Bom injection mediods discussed prevonsly « cap, 
bie o 1 creating any ot the three temporal fault types. 
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In the following, we will discuss two representative hardware-implemented fault injection environments: FTMP 
[Lala83] and MESS ALINE [Arlat89], 

4.1.1. FTMP 

Several studies in this area centered around the fault tolerant multiprocessor (FTMP) fault injection instrumen- 
tation [Lala83], [Shin86], [Finelli87], FTMP is a computer architecture which evolved over a 10-year period in con- 
nection with several critical aerospace applications [Hopkins78], The architecture was designed to have a failure rate 
of the order ot 10 10 per hour. The basic blocks of the architecture are independent processor-cache modules and 
memory modules which communicate through redundant buses. The modules are dynamically grouped into several 
TMR triads or assigned to spare status. Jobs can be scheduled to any processor triad. All transactions between proces- 
sor modules and memory modules in a triad are voted bit-by-bit. When a fault occurs, the faulty module is isolated 
and the faulty triad reconfigured. Fault detection, diagnosis, and recovery are handled in such a way that application 
programs are not involved. 

Figure 4.2 shows the diagram of the FTMP fault injection instrumentation developed at the Charles Stark 
Draper Laboratory [Lala83], [Finelli87], In an FTMP computer, there are several line replaceable units (LRUs), each 
containing a processor, clock generator, power subsystem, and bus interface circuits. LRU #3 is constructed for con- 
nection ot the fault injector. All chips in LRU #3 are connected to sockets which allow them to be removed for inser- 
tion ot the fault injection implant. Each fault injection implant contains circuitry which can interrupt and reconnect 
the pins in the sockets. Several different types of faults, such as stuck-at-0 and stuck-at-1, can be injected into the pins 
by the implants. These implants are controlled by a VAX 11/750 computer. A special version of the system configura- 
tion control (FSCC) program running in the FTMP communicates with the fault injection software (FIS) running in 
the VAX 11/750 through one of the FTMP I/O ports and a 1553/UNIBUS data link. 

Faults are normally injected on one pin at a time. When an injection occurs, the FIS program chooses a fault and 
a pin, applies the fault to the pin, and records the injection time. Once the FTMP detects and identities the fault and 
reconfigures the system, it sends this information along with the time of each event back to FIS. Upon receiving the 
information, FIS removes the fault by restoring the pin to its normal state and notifies the FTMP. The FTMP then puts 
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Figure 4.2 FTMP Fault Injection Environment 



(2) FTMP acknowledges 

(3) FTMP restores LRU #3 

(4) Fault injected 

(5) Data from FTMP 


FSCC Software 


the victim module back into an active state and notifies FIS that it is ready for another fault injection. This process is 
repeated alter a random delay. 

In the experiments conducted at the Charles Stark Draper Laboratory [Lala83], a total of 21,055 faults were 
injected, and 17,418 (83%) were detected. All of the detected faulLs were identified correctly, and the system subse- 
quently recovered successfully from each of these faults by replacing the faulty module. That is, the coverage in the 
FTMP was 100%, which validated the FTMP architecture and implementation. 

Another study using the FTMP fault injection instrumentation was reported in [Shin84], with emphasis on the 
investigation of fault latency. Results showed that the hazard rate of fault latency is monotonically decreasing. Two 
distributions with monotonically decreasing hazard rates, Weibull and gamma distributions, were then used to fit the 
experimental results. The study also investigated the effect of fault latency on the probability of having multiple faults. 
It was shown that there exists an optimal fault latency in minimizing the multiple fault probability. 

Later, fault injection experiments on the same instrumentation were conducted at the NASA Langley Research 
Center [Finelli87] to investigate two issues: fault sampling methods and fault recovery distributions. For each fault 
injection, two choices must be made: the fault location (pins) and the fault type (stuck-at-1, stuck-at-0, inverted signal, 
etc.). Thus, the possible fault set, or the collection of all different injected faults, can be very large. Exhaustive fault 
injection is costly and time consuming. It is necessary to find appropriate sampling methods to reduce the time and 
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cost of testing. The study compared the effects (detection behavior) of different faults and grouped these faults into 
several subsets according to the similarity in their effects. The results showed that the effects are not homogeneous 
across the fault set. This indicates that stratified sampling methods, based on the fault subsets, should be developed for 
fault injection. The study also showed that the fault recovery time is not exponentially distributed. 

4.1.2. MESSALINE 

MESSALINE [Arlat90] is a flexible, pin-level fault injection tool that has been developed at LAAS-CNRS in 
Toulouse, France. The general architecture of MESSALINE and its environment is given in Figure 4.3. The injec- 
tion, activation, and collection modules are implemented in hardware on an Intel 310 microcomputer. The software 
management module resides on a Macintosh II computer, which provides a flexible user interface. 

The fault injection mechanism for MESSALINE uses active probes and socket insertion. Thus, fault types such 
as stuck-at, open, bridging, and complex logical functions can be injected. Because the duration and frequency of 
faults can be controlled, the fault injector can introduce permanent, transient, and intermittent faults. Signals collected 
from the target system can provide feedback to the injector. Also, a device is associated with each injection point to 


Figure 4.3. General Architecture of MESSALINE 
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and affect software executions (hardware-induced software errors). These faults can occur in CPU, memory, bus, and 
networks. They may cause the system to execute incorrect instructions, access incorrect data, and produce incorrect 
results. By software faults, we mean software design/implementation defects (e.g., incorrect initialization of a vari- 
able or failure to check a boundary condition), and they may change software states to unexpected states. If software 
data is corrupted by either hardware or software faults, we call them software errors . 

At least two issues need to be addressed for software fault injections. The first is that when a fault is injected to 
a memory location or a register, who owns the memory location or which process is running on the processor. In 
other words, what is the target of the fault injection? The second issue is what fault models should be used to simulate 
hardware and software faults. We have discussed hardware fault models at function level in Section 3.3. Like hard- 
ware models, software models should be built based on engineer experience and field measurements. 

Several fault models and implementation techniques are listed in Table 4.2. All these techniques are similar in 
that they change program or memory words. To inject software faults, the text segment needs to be modified. Some 
typical software faults are: a variable is used before it is initialized; a module's interface is defined or used incorrectly; 
statements are in the wrong order or omitted [Sullivan91]. As a result of executing faulty software code, the data seg- 
ment may be corrupted, causing software errors. Software errors can also be directly injected by changing the data 
segment. 

When the software approach is used to emulate hardware faults, the faults are normally of transient nature. For 
example, the faulty bits in memory or CPU registers can be overwritten by subsequent instructions. However, the 


Table 4.2. Techniques Used for Software Fault Injection 


Type 

Method 

Software Fault 

Modify the text segment of the program. 

Software Error 

Modify the data segment of the program. 

Memory Fault 

Flip memory bits of the program. 

CPU Fault 

Use a trap to modify the memory area of the saved CPU registers. 

Bus Fault 

Use traps before and after an instruction to change the instruction or data 
used by the instruction and then restore them after the instruction is executed. 

Network Faults 

Modify or delete transmission messages. 
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software approach can be used to emulate permanent faults by repeatedly injecting the same fault to a location when- 
ever there is an access to the location. For example, to emulate a permanent stuck-at-0 fault at a particular bit in a 
memory word, the bit is changed to 0 after every write operation to the word. To emulate a permanent stuck-at-1 tault 
at a bus address line, the corresponding bit in the effective address (in the program counter or in a CPU register) is set 
to one before any access to the bus. It is obvious that the emulation is expensive, involving the monitoring and execu- 
tion of many extra instructions. 

Unlike hardware-implemented fault injections which are difficult to gear toward specific workload areas, soft- 
ware fault injections can be targetted toward user applications, the operating system, or both. It the target is user 
applications, the fault injector can be inserted into user applications or can be an extra layer between the user applica- 
tions and the operating system. If the target is the operating system, the fault injector has to be embedded in the oper- 
ating system, because it is very difficult to add an extra layer between the machine and the operating system. 

Although the software-based approach is very flexible, it has some restrictions. First, the approach cannot inject 
faults into locations not accessible to software. We have mentioned in Section 3.2 that approximately 1/3 of errors 
produced in logic-level fault injections cannot be emulated through the software approach [Czeck91]. Secondly, the 
software instrumentation may disturb the workload running in the target system and even change the structure of orig- 
inal software. A careful design can alleviate the perturbation to the workload. Another disadvantage is the low time 
resolution of the approach, which may cause fidelity problems. For the long latency faults, such as memory faults, the 
low time resolution may not be a problem. For the short latency faults, such as bus and CPU faults, the approach may 
fail to capture the error behavior (e.g., propagation). This problem can be solved by using a hardware monitor. i.e„ the 
hybrid approach [Young92], The hybrid approach combines the versatility of software-implemented injection and the 
accuracy of hardware monitoring. It is well suited for measuring extremely short latencies. 

There have been several studies using the software-based approach. In [Chillarege89], a failure acceleration 
method is used to inject the overlay software faults into an IBM commercial transaction processing system. In the fail- 
ure acceleration method, fault injections are designed such that the fault/error latency is decreased and the probability 
of a fault causing a failure is increased. An overlay occurs when a program writes into an incorrect area. It is estimated 
that about 1/3 of software errors can be mapped into the overlay model [Chillarege89|. The study quantified the 
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Table 4.3. Comparison of Software-Implemented Fault Injections 


Tool 

FIAT 

FERRARI 




[SegaI188] 

[Kanawati92] 

[Young92] 


Hardware 

PC RT 

SPARC 

Tandem S2 

Sun 

Injection 

o.s. 


OS. 

o.s. 

Target 

User 

User 

User 

User 

Monitor 

Software 

Software 

Hybrid 

Software 

Fault 

Memory 

Memory 

Memory 

Memory 

types 

CPU 

CPU 

CPU 

CPU 


Communication 

Bus 

Cache 

Bus 



Control flow 


Software 

To 

Detection 

Detection 

Detection 

Detection 

evaluate 

Latency 

Latency 

Latency 

Propagation 


Recovery 


Recovery 



immediate impact and potential hazards (which may cause a catastrophic failure in the future) of the injected faults. 

In recent years, interest in developing software-implemented fault injection tools has increased. Several envi- 
ronments have been published in literature: FIAT [Segall88], FERRARI [Kanawati92], HYBRID [Young92], and 
FINE [Kao93], Table 4.3 lists features of these tools, which will be discussed in the following subsections. 


4.2.1. FIAT 

A number of fault injection studies at Canegie Mellon University centered around FIAT (Fault Injection Auto- 
mated Testing), a software-implemented fault injection environment [SegalI88], [Barton90], [Czeck91]. The FIAT 
hardware implementation consists of IBM RT PCs connected by a token ring network. The FIAT software structure is 
divided into two parts: Fault Injection Manager (FIM) and Fault Injection REceptor (FIRE). FIM is a global control 
program responsible for all phases of the experiment. FIRE, under the control of FIM, collects the experimental 
results and sends appropriate information to FIM for off-line analysis. Figure 4.4 shows the process of a typical fault 
injection experiment. 

RAT has been used to study the impact of faults on the application workload level [Barton90], Two representa- 
tive programs, a matrix multiplication task and a selection sort task, were chosen as application workloads. To 
achieve fault tolerance, each task is executed on two different processors and results are compared. Three fault types 
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Figure 4.4. Typical Fault Injection Experiment in FTAT 



were injected in the experiment: zero-a-byte, set-a-byte, and two-bit compensating. The zero-a-byte or set-a-byte sets 
a consecutive 8 bits anywhere within a 32-bit word to zero or one. The 2-bit compensating complements 2 bits in a 
word such that the parity code would not detect it as an error. Faults were injected into all locations within a workload, 
with a total of over 130,000 faults injected. 

Results showed that there are a limited number of system-level fault manifestations. The mean error detection 
coverage tor different workloads and fault types is around 50% to 60%. Error detection latency was found to follow a 
normal distribution. This result conflicts with those presented in [Shin861, [Finelli87], where the latency was shown 
to follow either gamma. Weibull, or log-normal distributions. This difference may be explained by the differences in 
the experimental environment and detection mechanisms. In [Shin86], [Finelli87], the hardware-implemented fault 
injection technique is used, and the resolution of detection time is on the order of milliseconds, while the time resolu- 
tion of the software-implemented FAIT is on the order of seconds, which may skew the results. 
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4.2.2. FERRARI 


FERRARI (Fault and ERRor Automatic Real-time Injector), another software-implemented fault injection envi- 
ronment, was recently developed at the University of Texas at Austin [Kanawati92]. The purpose of the development 
of FERRARI was to evaluate complex systems by emulating most hardware faults in software. It was implemented on 
SPARC workstations in an X-window environment FERRARI consists of four software modules: 1) the initializer 
and activator , 2) the user information , 3) \ht fault and error injector , and 4) the collector and analyzer. These lour 
modules are controlled by the manager module which coordinates the operation of the four modules. 

The initialization and activation module prepares the target program for fault injection by extracting its informa- 
tion, such as the starting address, the program size, and the execution time. The user information module receives 
experiment parameters provided by the user, such as experiment mode, fault and error types, and dependability mea- 
sures to obtain. The fault and error injection module is responsible for injecting different types of transient or perma- 
nent faults, such as address line fault, data line fault, and fault in condition code flags. The data collection and analy- 
sis module records experiment results, such as information about error detection, error latency, and failures, and it 
determines statistics of these measures at the end of the experiment. 

To demonstrate the capabilities of FERRARI and to study the behavior of the target system under faulty condi- 
tions, over 600,000 fault injection runs were conducted on SUN4 SPARC workstations under different applications. 
Results showed that the error coverage is highly dependent on the fault type. The highest coverage was obtained when 
errors were injected in the task memory image. This is because the injected errors are likely to be exercised repeatedly 
if the corrupted instructions are in a loop. An important finding is that a considerable number of undetected errors are 
those that corrupted input/output routines and system libraries. These routines may lend to be ignored when error 
detection techniques are embedded in the user code. 

4.2.3. HYBRID 

A major drawback of the above purely software-implemented fault injection environments is the low resolution 
of detection lime. If the error detection mechanism is implemented with hardware, the time resolution is greatly 
enhanced. This approach is used in the hybrid fault injection environment developed at the University of Illinois at 
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Urbana-Champaign [Young92]. The hybrid environment combines the versatility of software injection and the accu- 
racy of hardware monitoring. It is well suited for measuring extremely short error latencies, and the introduced over- 
head is minimal so that error propagation and control flow are not significantly affected by the presence of instrumen- 
tation. 


In the hybrid environment, faults are injected via software, and the impact is measured by both software and 
hardware. Figure 4.5 illustrates the subsystems that make up the environment. It consists of a fault injection system, 
a hybrid monitor system to measure the effects of injected faults, and a supervisory system to automate the measure- 
ments. The hybrid monitor system is further divided into a hardware monitor and a software monitor. Figure 4.6 
illustrates how these systems are physically situated. The fault injector and software monitor execute on the test sys- 
tem, while the supervisor program executes on the control host . Probes attach the hardware monitor to the 
address/data backplane of the test system so that the monitor can analyze and record the signals generated. Communi- 
cation between the supervisor and the hardware monitor takes place over an RS-232 or GPIB connection. 


Figure 4.5. Hybrid Fault Injection Environment 
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Figure 4.6. Physical Layout of Hybrid Fault Injection System 
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The function of the environment is to perform experiments that repeatedly inject faults and record observations. 
The environment introduces faults into the test system during the execution of a target program, measures the effects 
of that fault, and returns the test system to conditions present prior to fault injection. These operations form a single 
observation loop. Faults can be injected into any location that has a physical address, e.g., CPU registers, cache, local 
memory, mass storage, and network controllers. Faults can also be injected into locations allocated to a single, exe- 
cuting user program or even into the kernel, and propagation can be characterized down to the instruction level. 

The fault injection environment was used to study dependability characteristics of a Tandem Integrity S2 fault 
tolerant computer system [Jewett91]. High degrees of accuracy in measuring latency (within 20ns) were obtained. 
Measurements of the sensitivity of different instructions to faults indicated a 5% chance that a faulted MIPS RISC 
instruction will not fail when executed. Modeling of multi-level error propagation showed that error detections were 
due to multiple corruptions of state in as many as 57% of reads from wrong addresses and 37% of writes to wrong 
addresses. The median latency associated with error detection by an individual CPU was on the order of 10 ps, and 
the median delay between detection and the start of CPU shutdown was on the order of 100ms. Kernel fault injection 
studies show that a fault in the kernel is 2.6 times as likely to bring down a CPU as a fault elsewhere. 

4.2.4. FINE 

FINE is a UNIX-based fault injection environment developed at the University of Illinois at Urbana-Champaign 
[Kao93). The significance of FINE is twofold. First, it is the first tool that can inject software faults as well as hard- 
ware errors. Second, it is the first tool built tor tracing fault propagation among software modules. The software 
taults that can be injected by FINE include initialization (missing or incorrect), assignment (missing or incorrect), 
condition check (missing or incorrect), and function (incorrect) faults. FINE can also inject hardware eirors such as 

CPU (ALU, shifter, opcode decoder, or registers), memory (text segment or data segment), and bus errors (address 
lines or data lines). 

Figure 4.7 Shows the FINE environment FINE consists ota/nui, m)ecmr, a software monitor, a workload gen- 
ermor. a con, roller, and several amlyets u, timer. The rank injeclor and software monitor are embedded in rhe kernel 
“ ““ ““ bC ***** there ” d *** Propagaiion can be monirored. Fault injection is implemented by 
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Figure 4.7. The FINE Environment 



modifying the system trap handling routines, so the fault injector can be considered an extra layer between the operat- 
ing system and the machine. The software monitor traces the execution flow and key variables of the kernel. Soft- 
ware probes are inserted into functions in the kernel to record the execution flow and the values of arguments and key 
variables. The synthetic workload generator issues various system calls to activate injected faults. The distribution of 
generated system calls can be specified by users to emulate real workloads or to deliberately accelerate the activation 
of injected faults. The controller assigns experiment specifications to the fault injector and the monitor, and it initiates 
experiments. The analysis utilities provide assistance in analyzing fault propagation. The target of the study is the 
UNIX kernel, a non-stopped, highly parameterized, complex service program with high impact and a broad spectrum 
of workloads. 

Experiments on SunOS 4.1.2 (on a SPARCstation IPC) were conducted by applying FINE to investigate fault 
propagation and to evaluate the impact of various types of faults. Results showed that memory faults and software 
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faults usually have a very long latency, while bus faults and CPU faults tend to crash the system immediately. Nearly 
90% of detected errors are detected by hardware. About half (47%) of the detected errors are data errors, these data 
errors are detected when the system tries to access an area it has no privilege to access. In the software fault propaga- 
tion, incorrect control flow is the major impact for the first level of propagation, while data corruption is the major 
impact for the subsequent propagation. Analysis of fault propagation among the UNIX subsystems revealed that only 
about 8% of faults propagate to other UNIX subsystems. 

43. Radiation-Induced Fault Injection 

Neither hardware- implemented nor software-implemented fault injections have a way to produce transient faults 
at random locations inside ICs. Radiation- induced fault injections provide such a capability. One way to do this is to 
expose the chip to the heavy-ion radiation from a Californium 252 (Cf 252 ) source [Gunneflo89], [Karlsson89]. The 
heavy ions emitted from the source are capable of creating transient faults when they pass through a depletion region 
in the IC. One advantage of this method is that it can produce transient faults at random locations evenly and can 
cause either a single bit flip or multiple bit flips. This leads to large variation in the errors seen on the output pins of 
the IC. 

In the fault injection experiments reported in [Gunneflo89], [KarIsson89], the C/ 252 method was used to investi- 
gate error coverage and detection latency for error detection schemes for the MC6809E 8-bit microprocessor. The 
intention of the experiments was to characterize the effects of transient faults that originate inside a CPU. The 
MC6809E is fabricated in NMOS, a technology sensitive to heavy ion radiation. The error detection schemes under 
study are suitable for implementation with a watchdog processor that checks the behavior of the main processor on the 
external bus. The developed experimental system is called FIST (Fault Injection system for Study of Transient fault 
effects). Figure 4.8 shows the FIST diagram. 

The heavy-ion radiation is implemented by using a commercially available 37xl0 3 Becquerel (1 juC,) Cf 252 
source. The Cf 252 source is mounted inside a vacuum chamber together with a small computer system. One of the sys- 
tem boards is placed on a mechanical fixture movable in three dimensions for accurate positioning of the CPU beneath 
the Cf 252 source. The system has two MC6809E CPUs which operate synchronously using the same clock. One CPU 
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Figure 4.8. FIST Diagram 



is exposed to heavy-ion radiation. The other is used as a reference to detect errors via comparison on the output from 
the two CPUs. When errors are detected by the comparison logic, the logic analyzer is triggered to record the external 
bus signals. The monitoring computer is responsible for data acquisition and control ot experiments. 

A fault injection experiment is conducted in the following way. Before the experiment starts, the monitoring 
computer fetches from the host computer a load file which contains the test program to be executed. The test program 
is then loaded from the monitoring computer to the MC6809E system. After the loading, the test program is started 
with a "go” command from the monitoring computer. When a mismatch is detected, the monitoring computer fetches 
the recorded error data from the logic analyzer and the error flip-flops in the MC6809E system and transfers them to 
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the host computer. Finally, the MC6809E system is reset, and the test program is reloaded tor the next experiment. 

It was found from fault injection experiments that 78% of all errors affected control flow (i.e., caused the pro- 
cessor to diverge from the correct sequence) and 17% caused errors in data Results also showed that 30% of all 
errors were multiple bit errors on the output pins, although the origin of each of these errors was only one single heavy 
ion. The error recordings obtained from the experiments were also used as input to simulation models of different 
error detection mechanisms to evaluate these error detection mechanisms without implementing them. The coverage 
of several detection mechanisms was investigated. It was found that the best mechanism was the one that detects 
access to the memory outside permitted areas and that the combination of two mechanisms gave a better coverage 
than any one mechanism alone. It was also found that the type of the test program had a considerable influence on the 
results of error detection mechanisms. 
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V. OPERATIONAL PHASE 
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Figure 5.1. Measurement-Based Analysis 
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Step 1 Step 2 Step 3 Step 4 


models to obtain some other measures (such as reliability and transient reward rates). Dependability and performance 
modeling and evaluation tools such as SHARPE [Sahner87] can be used in this step. The most creative part is step 4, 
the human analysis of models and measures obtained from data. New results are produced in this phase. For example, 
reliability botdenecks can be identified from analysis of error/failure statistics, and workload/failure dependency can 
be concluded by analysis ot models. However, analysis methods may vary significantly from one study to another, 
depending on research goals. 

Measurement-based dependability analysis of operational systems has evolved significantly over the past 15 
years. These studies addressed one or more of the following issues: basic error characteristics, dependency analysis, 
modeling and evaluation, software dependability, and fault diagnosis. The following paragraphs give a brief overview 
of these studies, which are listed in Table 5.1. 

Early studies in this field investigated transient errors in DEC computer systems and found that more than 95% 
of all detected errors are intermittent or transient errors [Siewiorek781, [McConnel79], The studies also showed that 
the inter-arrival time of transient errors follows a Weibull distribution with a decreasing error rate. This distribution 
was later shown to fit the software failure data collected from an IBM operating system [Iyer85b]. A recent study of 
failure data from three different operating systems showed that TTE (lime to error) can be represented by a multi- 
stage gamma distribution tor the measured single-machine operating system and by hyperexponential distributions for 
the measured distributed operating systems [Lee93a]. 

Studies ot dependency between workload and failure in early 1980s, based on measurements from IBM [But- 
ner80] and DEC [Castillo81] machines, revealed that the average system failure rate is strongly correlated with the 
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Table 5.1. Measurement-Based Studies of Computer System Dependability 


Category 

Issues 

Studies 

Data 

Coalescing 

Analysis of time-based tuples 
Clustering based on type and time 

[Tsao83], [Hansen92] 
[Iyer86], [Lee91], [Tang93a] 

Basic 

Error 

Characteristics 

Transient faults/errors 
Error/failure bursts 
TTE/TTF distributions 

[Siewiorek78), [McConnel79], [Iyer86] 
[Iyer86], [Hsueh87], [Tang93a] 
[McConnel79], [Iyer85b], [Lee93a] 

Dependency 

Analysis 

Hardware failure/workload dependency 
Software failure/workload dependency 
Correlated failures and impact 
Two-way and multi-way failure dependency 

[Butner80], [CastilloSl], [Iyer82al 
[Castillo82], [lyer85b|, [Mourad87] 
[Tang90], [Wein90], [Tang92a] 
[Dugan91], [Lee91], [Tang91] 

Modeling 

and 

Evaluation 

Performability model for single machine 
Markov reward model for distributed system 
Two-level models for operating systems 

[Hsueh88] 

fTang93a] 

[Lee93a] 

Software 

Dependability 

Error recovery 

Hardware -related & correlated software errors 
Software fault tolerance 
Software defect classification 

[Velardi84], [Hsueh87] 
[Iyer85a], [Tang92b], [Lee93a] 
[Gray90], [Lee92], [Lee93b] 
[Sullivan91], [Sullivan92] 

Fault 

Diagnosis 

Heuristic trend analysis 
Statistical analysis of symptoms 
Network fault signature 

[Tsao83], [Lin90] 

[Iyer90] 

[Maxion90a], [Maxion90b| 


average workload on the system, The effect of workload-imposed stress on software was investigated in [Castillo82) 
and [IyerSSb]. Recent analyses of DEC [Tang90], [Wein90] and Tandem [Lee91] multicomputer systems showed that 
correlated failures across processors are not negligible, and their impact on availability and reliability are significant 
[Dugan91], [Tang91], [Tang92a], 

In [Hsueh88], analytical modeling and measurements were combined to develop measurement-based reliabil- 
lty/performability models using data collected from an IBM mainframe. The results showed that a semi-Markov pro- 
cess is better than a Markov process for modeling system behavior. Markov reward modeling techniques were further 
applied to distributed systems [Tang93a] and fault tolerant systems [Lee92], to quantify performance loss due to 
errors/failures tor both hardware and software. A census of Tandem system availability indicated that software faults 
are the major source of system outages in the measured fault tolerant systems [Gray90]. Analyses of held data from 
different software systems investigated several dependability issues including the effectiveness of error recovery 
[ Velardi84|, hardware-related software errors [Iyer85a], correlated software errors in distributed systems [Tang92b], 
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software fault tolerance [Lee92], [Lee93b], and software defect classification [Sullivan91], [Sullivan92], Measure- 
ment-based fault diagnosis and failure prediction issues were investigated in [Tsao83], [Iyer90], [Lin90], [Max- 
ion90a], [Maxion90b]. 

In the following subsections, we discuss issues and representative studies involved in measurements, data pro- 
cessing, preliminary analysis ot data, dependency analysis, modeling and evaluation, software dependability, and tault 
diagnosis. 

5.1. Measurements 

There are numerous theoretical and practical difficulties associated with making measurements. The question of 
what and how to measure is a difficult one. A combination of installed and custom instrumentation has been used in 
most studies. From a statistical point of view, sound evaluations require a considerable amount ot data. In modem 
computer systems, especially in fault tolerant systems, failures are rare. To obtaining meaningful data tor such sys- 
tems, measurements must be made for a considerably long period of time, or sometimes the measured system must be 
exposed to high-stress conditions. 

In an operational system, only detected errors can be measured, because an error is known only if it is detected. 
There are basically two ways to make measurements: on-line automatic logging and human manual logging. Many 
large computer systems such as IBM and DEC mainframes provide error-logging software in the operating system. 
The software records error reports from different subsystems, such as memory or disk subsystems, and other system 
events, such as reboots and shutdowns. The reports usually include information about the location, time, and type of 
the error, the system state at the error time, and sometimes error recovery (e.g., retry) information. The recorded 
reports are stored in a permanent system file chronologically. The main advantage of the on-line automatic logging is 
its ability to record a large amount of information about transient errors and to provide details of automatic error 
recovery processes, which cannot be done manually. Disadvantages are that information can be lost when a system 
fails loo quickly for error messages to be recorded, and that an on-line log does not include information about the 
cause and propagation of the error or about off-line diagnosis. 
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7005 

12979 

13005 

13734 

3260 

10939 

14209 

13941 

20937 

27958 

37790 


System ID 

Earth 
Earth 
Europa 
Europa 
Europa 
Mercury 
■Jupiter 
Jupiter 
Mars 
Mars 
Mars 
Saturn 


Table 5.2. A Sample of Extracted Error Logs from a VAXclusterf 


Logging Time 

20-DEC- 1987 20:23:13.22 
4-JAN-1988 11:45:07.12 
8-JAN-1988 14:14:28.63 
8-JAN-1988 16:23:17.41 
19-JAN-1988 17:31:30.74 

24- DEC- 1987 04:54:52.06 
1 -APR- 1988 09:57:39.40 
16-MAY- 1988 13:37:04.97 

25- FEB-1988 02:13:20.25 
18- APR- 1988 16:46:39.75 
14-MAY- 1988 20:57:46.48 
20-JUL-1988 18:51:49.15 


Subsystem & Unit 

I/O, H0SDUA51: 
I/O, H3SMUA1: 

Cl, EURSPAA0: 

CE EURSPAA0: 

Cl. EURSPAA0: 
Memory, TR #2 
Unknown Device 
CPU, SBI 
CPU, IBOX 
BugCheck 
BugCheck 
BugCheck 


Interpretation 

Disk drive error 

Tape drive error 

Path #0 went from good to had 

Error logging datagram received 

Virtual circuit timeout 

Corrected memory error 

Unexpected read data fault 
Machine check 

Bad memory deallocation request s lZ e or address 
Insufficient nonpaged pool to remaster locks 
Unexpected system service exception 


s ystem service exception 

e sample ts intended to illustrate the different types of errors logged. Therefore, the entry numbers are not consecutive 

W me infcmaUon pmvided by ou-ime mor iogs may urn be comp, cm, „ is vmuabie have operafor 

“ m ' SS,nS ^ P-iMe. measurements sbouid inciude brf, 

am. opetamr logs. A gem, opemim iog sheuid meiude mformadon aimui faiirne diag„„s,s, eompouen, repiaeemen, 
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5.2. Data Processing 

Usually, on-line logs contain a large amount of redundant ; 


data processing must be performed to obtain 


and irrelevant information in various formats. Thus, 
useful, classified information and put it into a flat format that will facili- 
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Table 5.3. Major Error Types in VAXcIuster 


System 

Type 

Description 

Hardware 

CPU 

Memory 

Disk 

Network 

CPU or bus controller errors 
Memory ECC errors 
Disk, drive, and controller errors 
Local network and controller errors 

Software 

Control 

Memory 

170 

Problems involving program flow control or synchronization 
Problems referring to memory management or usage 
Inconsistent conditions detected by I/O management routines 


error types, such as CPU, memory, and disk errors, are seen in most systems. Table 5.3 lists an error classification 
(major error types) for VAXcIuster systems [Tang92b], [Tang93a]. 

Atter error classification, the following data processing can be broadly divided into two steps: data extraction 
and data coalescing. Data extraction selects useful entries such as error and reboot reports (throwing away useless 
entries such as disk volume change reports) from the log file and transforms them into a flat format. The design of the 
flat format depends on the necessity of the subsequent analyses. The following is a possible format: 


entry number 

logging time 

error type 

device id. 

other fields 


In on-line error logs, a single fault in the system can result in many repeated error reports in a short period of 
time. To ensure that the subsequent analyses will not be biased by these repeated reports, entries which correspond to 
the same problem should be coalesced into a single event. A commonly used coalescing algorithm [Iyer86] is merging 
all error entries which have the same error type and occur within a A7 1 interval of each other into a tuple. The algo- 
rithm is as follows: 

IF <error type> = ctype of previous error> 

AND dime away from previous error> < A T 
THEN <put error into the tuple being built> 

ELSE <start a new tuple> 

A tuple reflects the occurrence ot one or more errors of the same type in rapid succession and can be repre- 
sented by a record containing at least the following fields [Tsao83], [Tang93b]: 
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( 1 ) 


tuple_id — identification of the tuple 


(2) no_entry — number of error entries in the tuple 

(3) startjime — logging time of the first entry in the tuple 

(4) end_time — logging time of the last entry in the tuple 

(5) err_type — error type of the tuple 

Different systems may need dilferent time intervals in dam coalescing. A recent stody on this issue [HansenDl] 
denned two types of mistakes that can be made in data coalescing: collision and irmcaiion. A collision occurs when 
Ihe detecion times of two faults are close enough (within AD such that they are combmed mto a tuple. A truncation 
occtns when the tune between two reports caused by a single fault is greater dun. AT such that the two reports are split 
inn, dilferent tuples. II AT is large, collisions are likely to occur. If AT is small, truncations are likely to occur. The 
study found that there is a threshold of time intervals beyond which collisions are rap.dly increased. Based on this 
observation, the study proposed a statical models which can be used to select an appropriate time interval to reduce 
According to our expertence, collision is not a btg problem if the etror type and device mformadon is used 
in data coalescing as shown in Ihe above coalescing algorithm. Truncation is usually not considered to be a problem 
[Hansen92|. There are techniques HyertO], [LinDOl which deal with this problem and which are used for iault dtagno- 
sis and failure prediction (to be discussed in Section 5.7). 

S3. Preliminary Analysis 

Once coalesced data is obtained, basic dependability characteristics of the measured system can be identified by 
a preliminary statistical analysis. Commonly used measures in the analysis include error/fa, lure frequency, TTE or 
TTF distribution, and crror/failure hazard rate function. In the following discussion, data from a VAXcluster system 
[Tang93a] is used to illustrate analysis methods. 

5.3.1. Basic Statistics 

Although it is not difficult, it is important to first obtain basic statistics such as frequency, percentage, and prob- 
ability from the measured data. These statistics provide a basic picture of the measured system. Often, dependability 
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Table 5.4. Error/Failure Statistics for the VAXcluster 


Category 

Error 

Failure 

Recovery 

Probability 

Frequency 

Percentage 

Frequency 

Percentage 

I/O 

25807 

92.87±0.30 

105 

42.86±6.20 

0.99610.001 

Machine 

1721 

6.19±0.28 

5 

2.04±1.77 

0.97010.002 

Software 

69 

0.25±0.06 

62 

25.3115.44 

0.10110.071 

Unknown 

191 

0.6910.10 

73 

29.8015.73 

0.61810.069 

All 

27788 

100.0 

245 

100.0 

r0.991i0.001 


bottlenecks can be identified by analysis on the statistics. Table 5.4 shows the error/failure statistics for the measured 
VAXcluster. In the table, I/O errors include disk, tape, and network errors. Machine errors include CPU and memory 
errors. Software errors are software-related errors. The 95% confidence intervals for the percentage and probability 
estimates shown in the table are calculated using the method discussed in Section 2.1 for estimating confidence inter- 
vals tor proportions. Two bottlenecks can be identified from the table. 

First, the major error category is I/O errors (93%), i.e„ error, from shared resources. This category of error has 
a very high recovery probability (0.996). However, these errors still result in nearly 43% of ail failures. This result 
indicates that, although the system is generally robust to the impact of I/O errors, the shared resources still constitute a 
major reliability bottleneck due to the sheer number of errors. An improvement in such a system may require using an 
ultra-reliable network and a disk system to reduce the raw error rate, not just providing high recoverability. 

Secondly, although software errors constitute only a small part of all errors (0.3%), they result in significant fail- 
ures (25%). This is because software errors have a very low recovery probability (0.1). This software failure estima- 
tion is conservative because there are significant unknown failures (30%). Some of these unknown failures could be 
attributed to software problems. Thus, software-related problems are severe in the measured system. 

5.3.2. Empirical TTE Distributions and Hazard Rates 

TTE/TTF probability distributions and error/failure hazard rates are commonly used to investigate how errors 
and failures occur across time. It is relatively easy to obtain empirical TTE/TTF distributions from data. Figure 5.2 
shows the empirical TTE distribution function, /(/), for a measured VAXcluster system [Tang93a|. Notice that the 
logarithmic coordinate is used for /(,) because of the big contrast between the largest and smallest values. It is seen 
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that about 67% of the TBEs are less than one minute. Most of these instances are "time between errors of two differ- 


ent machines" because errors of the same type occurring within a five minute interval of each other on the same 
machine have been coalesced into a single error event. This fact implies that errors are likely to occur on the different 
machines in the measured system within a very short period of time. 

The hazard rate characterizes error/failure intensity on time series. It can be considered to be the probability 
that an error (failure) will occur within the coming unit of time, given that no error (failure) has occurred since the 
start of the system or the last error (failure) occurrence. The mathematical definition of the hazard rate [Ross85] is as 
follows: 

h(t) = Pr ! error tn (± l+dt >/ _ fU) 

Prfno errors in (0, t)j dt ~ l-F(r) 

Figure 5.3 shows the empirical failure hazard rates computed from the VAXcluster failure data. The high hazard 
rate near the origin, i.e., the high probability that the second failure will occur within a short time after a failure occur- 
rence, indicates that failures in the VAXcluster tend to occur in bursts. The most likely for a second failure is the first 
two hours after a failure occurrence. Failure bursts have been observed by many studies [Iyer86], |Hsueh87], 
[Bishop88], Actually, in an early study of transient errors [McConnel79], the Weibull distribution with a decreasing 

failure rate identified for the interarrival time of failures caused by transient errors implicated the existence of failure 
bursts. 


Figure 5.2. VAXcluster Empirical TTE Distribution 
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5.3.3. Analytical TTE Distributions 


A realistic, analytical form of TTE distributions is essential in modeling and evaluating computer system 
dependability. Often, tor simplicity or due to lack of information, 1 lbs are assumed to be exponentially distributed 
[Arlat90b], [Laprie84], Early measurement-based studies found that the Weibull distribution with decreasing failure 
rate is representative of the time between failures (TBF) in a measured DEC computer system [McConnel79] and a 
measured IBM operating system [Iyer85b). A recent comparative study of the dependability of the Tandem 
GUARDIAN, VAX VMS, and IBM MVS operating systems showed that the software TTE in a single machine can be 
represented by a multi-stage gamma distribution and the software TTE in multicomputers can be represented by a 
hyperexponential distribution [Lee93a], In this section, we discuss these two types of distributions. 

Before presenting the analytical TTE distributions, we first explain how a TTE distribution is obtained from a 
multicomputer system, because both measured GUARDIAN and VMS were running on multicomputer systems. In 
the measured multicomputer systems, all machine members are working in a similar environment and running the 
same version of the operating system. If the whole system is treated as a single entity in which multiple instances of 
an operating system are running concurrently, then every software error on all machines can be sequentially ordered 
and a distribution can be constructed. The constructed TTE distribution reflects the software error characteristics for 
the whole system. We will call this distribution the multicomputer software TTE distribution. 


Figure 5.4. IBM MVS Software TTE Distribution 
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Figure 5.5. VAXcluster Software TTE Distribution Figure 5.6. Tandem Software TTH Distribution 
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Figures 5.4 to 5.6 show the analytical TTE or TTH (Time To Halt) distributions fitted using SAS for the three 
measured systems. All the three empirical distributions failed to fit simple exponential functions. The fitting was 
tested using the Kolmogorov-Smimov or Chi-square test (see Section 2.2) at a 0.05 significance level. The two-phase 
hyperexponential distribution provided satisfactory fits for the VAXcluster and Tandem multicomputer software TTE 
distributions. An attempt to fit the MV S TTE distribution to a phase-type exponential distribution led to a large num- 
ber of stages. As a result, the following multi-stage gamma distribution was used: 
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It was found that a 5-stage gamma distribution provided a satisfactory fit. 

Figures 5.5 and 5.6 show that the multicomputer software TTE distribution can be modeled as a probabilistic 
combination ot two exponential random variables, indicating that there are two dominant error modes. The higher 
error rate, Ai, with occurrence probability « 3 , captures both the error bursts (multiple errors occurring on the same 
operating system within a short period of time) and concurrent errors (multiple errors on different instances of an 
operating system within a short period of time) on these systems. The lower error rate. A,, with occurrence 
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probability a, , captures regular errors and provides an inter-burst error rate. 

Error bursts can be explained as repeated occurrences of the same software problem or as multiple effects of an 
intermittent hardware fault on the software. Actually, software error bursts have been observed in laboratory experi- 
ments reported in [Bishop88], The study showed that, if the input sequences of the software under investigation are 
correlated (rather than independent), one can expect more "bunching" of failures than those predicted using a constant 
failure rate assumption. In an operating system, input sequences (user requests) are highly likely to be correlated. 
Hence, a defect area can be triggered repeatedly. 

5.4. Dependency Analysis 

Many underlying dependencies exist among measured parameters and components, such as the dependency 
between workload and failure rate and the dependency among failures on different components. Understanding such 
dependency is important for improving system dependability and developing realistic models. In this regard, the 
workload/failure dependency issue was studied in the early 1980s and the correlated failure issue was investigated 
recently. 

Dependency between workload and failure was addressed in two approaches: statistical quantification of the 
dependence between workload and failure rate [Butner80], [Iyer85b] and stochastic modeling of failures as functions 
of workload [Casdllo81]. Both demonstrated the strong correlation between workload and failure rate. This result 
indicated that dependability models cannot be considered representative unless the system workload is taken into 
account. Based on this result, several workload-dependent analytical models have been proposed [MeyerJ88], [Aup- 
perle89], [Dunkel90]. 

Recent measurements on VAXclusters [Tang90], [Wein90| and Tandem machines [Lee911 found that correlated 
failures are not negligible in distributed systems. Further studies showed that even a small correlation can have big 
impact on system dependability [Dugan91], [Tang91], [Tang92al. It was also shown that neither traditional models 
assuming failure independence nor those few models believed to take correlation into account are representative of the 
actual occurrence process of correlated failures observed in the measured systems [Tang93b], 
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In the following three subsections, dependency analysis is illustrated through three examples: 1) using a work- 
load hazard model to analyze the dependency between workload and software failures in an IBM 3081 system, 2) 
using the correlation analysis method to analyze the two-way dependency between errors on two different machines 
in a VAXcluster system, and 3) using the factor analysis method to analyze the multi-way dependency among failures 
on multiple processors in a Tandem fault tolerant system. 


5.4.1. Workload/Failure Dependency 

An early study [Castillo81] introduced a workload-dependent cyclostationary model to characterize system fail- 
ure processes. The basic assumption is that the instantaneous failure rate of a system resource can be approximated by 
a function of the usage of the resource considered. The model was applied to a PDP-10 machine running a modified 
version ot the standard TOPS-10 operating system. It was shown that the TIT distribution predicted by the model and 
the one observed from the real system have an extremely good fit. 

In [Iyer82a], a load hazard model was introduced to measure the risk of a failure as the system activity 
increases. The proposed model is similar to the hazard rate defined in Eq (5.1). Given a workload variable X, the load 
hazard is defined as 

Pr[ failure in load interval (.r, x + Ajt)l ?(.t) 

X) = — — — 2 /c 

Pr[no failure in load interval (0, x)] Ax 1 - G(x) ' ; 

where g{x) is the p.d.t. ot the variable a failure occurs at a given workload value x" and G(x) is the corresponding 
c.d.f. That is, 


#(*) = Pr[ failure occurs | X = x] = 


fix) 

Kx) 


where l(x) is simply the p.d.f. of the workload in consideration: 


(5.3) 


Kx) = Pr[X = x] , ( 5 . 4 ) 

and fix) is the joint p.d.f. of the system state (failure state or non-failure state) and the workload: 

fix) - Pr[failure occurs & X = x]. (5.5) 

A constant hazard rate implies that failures are occurring randomly with respect to the workload. An increasing 
hazard rale on the increase of X implies that there is an increasing failure rate with increasing workload. 
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Figure 5.7. Workload Hazard Plots for the IBM 3081 System 





The load hazard model was applied to the software failure and workload data collected from an IBM 3081 sys- 
tem running the VM operating system. Based on the collected data, l(x), fix), gix), and z(x) were computed tor 
each workload variable. Figure 5.7 shows the z(jc) plots for three selected workload variables: 

(1) OVERHEAD — fraction of CPU time spent on the operating system; 

(2) PAGEIN — number of page reads per second by all users; 

(3) SIO (Start I/O) — number of input/output operations per second. 

The regression coefficient, R 2 , which is an effective measure of the goodness of fit, is also provided in the figure. 

The hazard plots show that the workload parameters appear to be acting as stress factors, i.e., the failure rate 
increases as the workload increases. The effect is particularly strong in the case of the interactive workload measures 
OVERHEAD and SIO. The correlation coefficients of 0.95 and 0.91 show that the failure closely fit an increasing load 
hazard model. The risk of a failure also increases with increased PAGEIN, although at a somewhat lower correlation 
(0.82). Note that the vertical scale on these plots is logarithmic, indicating that the relationship between the load haz- 
ard z(.x) and the workload variable is exponential, i.e., the risk of a software failure increases exponentially with 
increasing workload. 


5.4.2. Two-Way Dependency 

It was mentioned in Section 2.3 that the correlation coefficient can be used to quantify the linear dependence 
between two variables. When errors/failures on two components are related, the correlation coefficient between the 
two components is a good measure of such dependence. The question is how to obtain it from measured data. 
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The first step in correlation analysis is building a data matrix based on the measured data. Assume that there are 
n components in the measured system and the measured period is divided into m equal intervals of At (e.g„ 5 min- 
utes). An mxn data matrix can then be constructed in the following way. The n columns of the matrix represent the n 
components in the measured system. The m rows of the matrix represent the m time intervals. Element O', j) of the 
matrix is set to the number of errors occurring within interval i on component Column j can be regarded as a sam- 
ple of the random variable, X jy which represents the state of component j in the system. 

The second step is calculating correlation coefficients using Eq. (2.19) based on the data matrix. Each time, we 
pick up two columns (Xj and Xj) to calculate Cor(X n Xj). This step can be automated by using a statistical package 
such as SAS. Table 5.5 lists the average correlation coefficients of the 21 pairs of machines in a VAXcluster for dif- 
ferent types of errors and failures [Tang93a]. Generally, the error correlation is high (0.62) and the failure correlation 
is low (0.06). Disk and network errors are strongly correlated, because the processors in the system heavily use and 
share the disks and the network concurrendy. 


Table 5.5. Average Correlation Coefficients for VAXcluster Errors 


Error 

Failure 



Memory 

Disk 

Network 


All 



0.01 

0.78 

0.70 


0.06 


5.4.3. Multi-Way Dependency 

If errors/failures on more than two components are related, the correlation coefficient is not enough to quantify 
the dependence among these components, i.e., multi-way correlation. In such a case, the factor analysis method intro- 
duced in Section 2.3 can be used to uncover the underlying multi-way correlation. In this subsection, the application 
of factor analysis is illustrated using the processor failure data collected from a Tandem fault tolerant system [Lee91 ]. 

Similar to the correlation analysis discussed above, the first step is building an mxn data matrix based on mea- 
surements, where n is the number of components in the system. The measured Tandem system is an 8-processor mul- 
ticomputer, i.e., n is 8. The At used is 30 minutes. The element (/, j) of this matrix has a value of 1, if processor j 
halts during the i-th time interval; otherwise, it has a value of 0. The j-th column of the matrix represents the sample 













Table 5.6. Factor Pattern of the Tandem Processor Halts 


Processor 

Factor 1 

Factor 2 

Factor 3 

Factor 4 

Communaiity 

1 

0.997 

-0.004 

-0.069 

0.023 

1.00 

2 

0.000 

0.000 

0.000 

0.000 

0.00 

3 

0.061 

0.012 

0.853 

-0.133 

0.75 

4 


0.999 

-0.011 



5 


- 0.000 

0.188 



6 

1 

0.447 

-0.005 


I 

7 


-0.002 

0.862 



8 


0.762 

0.090 



Var. 
Var. % 

m 

m 

EH 

0.685 

8.6 



halt history of processor j, while the i-th row of the matrix represents the state of the eight processors in the f'-th time 
interval. The matrix is thus called a processor halt matrix. 

The second step is performing factor analysis by applying the SAS procedure FACTOR to the processor halt 
matrix. The results are shown in Table 5.6. The numbers in the middle of the table are factor loadings, and the last 
column shows communaiity. The bottom two rows show the amount of variances explained by the common factors 
and their percentages to the total variance. 

According to [Dillon84], factor loadings greater than 0.5 are considered to be significant. However, in reliabil- 
ity analysis, factor loadings lower than 0.5 can be significant. The results show that there are four common factors. 
Factor 1 captures the dependence between processor 1 and processor 5 and accounts for 24.6% of the total variance. 
Factor 2 captures the multi-way dependence among processors 4, 6, and 8, although the contribution of processor 6 is 
small (0.447 2 , i.e., 20% of its variance is explained by this factor). Factor 2 explains 22.3% of the total variance. Fac- 
tor 3 captures the dependence between processor 3 and processor 7, and contributes 19% to the total variance. Factor 4 
captures the dependence, although it is lower (with factor loadings 0.506 and 0.641), between processor 7 and proces- 
sor 8, and accounts for 8.6% of the total variance. 


5.5. Markov Reward Modeling 

Many natural and social phenomena can be modeled by Markov or semi-Markov stochastic processes 
[Trivedi82], In computer area, Markov process is one of the most frcquendy used models in performance and depend- 
ability evaluation. Compared to combinatorial models, Markov models have several advantages, such as the ability to 
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handle time-dependent failure rate, performance degradation, and interactions among components. In the area of ana- 
lyttcal modeling of computer systems, perfomability models [MeyerJSOl, lMeyerJ92], avadabil.ly models (Goyal87l, 
and Markov reward models |Reibman89], |Trivedi92| have all been addressed during the past 1 5 years. However, how 
to apply these techniques to measured data is still not clear. Assumptions made in building analyttcal models also 
need to be validated by measurement-based analysis. 

In analytical analysis, Markov models are built based on some assumptions (such as independent failures on dif- 
ferent components) using individual component parameters (such as failure and recovery rates). The evaluated results 
are highly dependent on input parameters and model assumptions. In measurement-based modeling, Markov models 
are identified from data and therefore called measured models [Tang93b], No additional assumptions (more than the 
Maikov property) are made in the construction of models. The measured models provide the best evaluation for real 
systems as well as insight into the development of representative analytical models. Thus, it is valuable to identify 
appropriate models from measured data. Measurement-based Markov reward modeling techniques are illustrated 
through a system model generated for a VAXcluster and a software model generated for an IBM operating system. 

5.5.1. Modeling of a Distributed System 

The data used for the modeling was collected from a DEC VAXcluster system, consisting of seven machines, 
for 250 days [Tang93a]. For this system, an error was defined as an abnormality in any component ot the system. II 
an error led to a termination of service on a machine, it was defined as a failure. A failure was identified by a reboot 
following one or multiple error reports. 

A. Model Construction 

Since the measured VAXcluster has seven machines, an 8-state Markov error model is constructed. The eight 
states, £,, ..., and £ 7 , are defined such that E, represents the state wherein i machines observe errors at the same 
time (the time granularity is chosen to be 1 second). For example, state E 0 represents that none of the machines expe- 
riences errors, i.e., the VAXcluster is in the normal (error-free) state; state E-, represents that all the machines experi- 
ence errors. At any measured time, the VAXcluster is in one of these states. 
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The transition probabilities for the 8-state model is estimated from the error event data. Given that the system is 
in state i, the probability that it will go to state j, p i} h is calculated as follows: 


_ observed number of transitions from E t to E } 

{} observed number of transitions out of E x 

Table 5.7 shows the transition probabilities calculated from the VAXcluster error data. Based on the table, an 

error propagation model can be obtained by calculating the probability that the system goes from state E t (i = 1 6) 

to any ot the lower states (£)_ j, Eq) and the probability that it goes from E x to any of the higher states (E x + j, 

E-j). These probabilities are easily determined by summing all the row elements to the left of element (/, /), and all 
row elements to the right of element (/, i) in the tables. The error propagation model is shown in Figures 5.8. An 
interesting error propagation characteristic is uncovered with this model. Notice that the transition probabilities to 
higher states (numbers in the upper line) tend to increase as the state increases. That is, once an error domain encom- 
passes more than one machine, the probability of the domain involving more machines increases. In such situations, 
error containment can become increasingly difficult. 


Table 5.8 shows the mean holding time, the total holding time in the measured period, and the occupancy proba- 
bility in each state for the model. It is seen from the table that E n has the longest mean holding time (2.31 minutes) 


Table 5.7. Transition Probability for the VAXcluster Error Model 


State 

E o 

Ex 

E 2 

E 3 

E 4 

E s 

E 6 

E 7 

E o 

.000 

.891 

.084 

.014 

.004 

.002 

.002 

.003 

Ex 

.824 

.000 

.145 

.023 

.004 

.003 

.001 

.000 

E 2 

.239 

.594 

.000 

.118 

.035 

.009 

.004 

.001 

E 3 

.126 

.211 

.401 

.000 

.227 

.024 

.009 

.003 

E t 

.079 

.147 

.102 

.422 

.000 

.205 

.034 

.011 

E 5 

.058 

.115 

.054 

.073 

.367 

.000 

.315 

.018 

e 6 

.070 

.081 

.024 

.016 

.073 

.406 

.000 

.331 

E 7 

.125 

.104 

.000 

.021 

.036 

.161 

.552 

.000 


Figure 5.8. An Error Propagation Model for the VAXcluster 
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Table 5.8. Holding Time (HT) & Occupancy Probability for the VAXcluster Error Model 


State 

Mean HT (min.) 

Total HT (hr.) 

Occ. Prob. 

E 0 

22.39 

5578.89 

0.9298 

D 

1.27 

347.42 

0.0579 

e 2 

0.40 

29.24 

0.0049 

Ei 

0.56 

14.07 

0.0023 

E* 

1.07 

15.13 

0.0025 

E 5 

0.40 

3.37 

0.0006 

e 6 

0.73 

4.50 

0.0007 

E, 

2.31 

7.38 

0.0012 


among all error states. Clearly, when all seven machines are affected by errors, the system takes the longest time to 
recover. The occupancy probabilities provide evidence that errors on different machines (i.e., errors in the higher 
states) are related. It is found that the measured occupancy probabilities for the higher states (E 3 to E 7 ) are quite dif- 
ferent from the occupancy probabilities analytically determined assuming error independence. For example, we con- 
sider the occupancy probability tor E 7 . By Table 5.8, the measured occupancy probability for E 7 is 0.0012. Assuming 
that errors on different machines are independent, we can easily determine the occupancy probability for this state to 
be at most 0. 02 7 , where 0.02 is the highest error occurrence probability among the seven machines. That is, the mea- 
sured value is higher than the calculated value by at least eight orders of magnitude. 


B. Reward Analysis 

Markov models can be used to conduct reward analysis [Trivedi92] to quantify the loss of service due to errors 
and failures. The key step is to define a reward function which characterizes the performance loss in each degraded 
state. For a multicomputer system, a generic reward function can be defined for both a single machine and the whole 
system. Given a time interval AT (random variable), a reward rate for the system in AT is determined by 

/•(AT) = W(AT) / AT , (5.7) 

where W(A7 ) denotes the useful work done by the system in AT and is calculated by 


W(AT) = 


AT 

AT - nr 
0 


in normal state 
in error state 
in failure state , 


(5.8) 


where n is the number of raw errors (error entries in the log, see Section 5.2) in AT and ris the mean recovery time for 
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a single error. Thus, one unit ot reward is given for each unit of time when the system is in the normal state. In an 
error state, the penalty paid depends on the recovery time the system spends in that state, which is determined by the 

linear function AT-nx (normally, AT > nr, if AT < nr, W(AT) is set to 0). In a failure state, W(AT) is by definition 
zero. 


Applying Eq. (5.8) to the VAXcluster, the reward rate formula has the following form: 

/-(AT) = f W k (AT) / (7 x AT ) , (5 9) 

where W k (AT) denotes the useful work done by machine k in time AT. Here all machines are assumed to contribute 

an equal amount of reward to the system. For example, if three machines fail when the system is in E„ the reward rate 
is 4/7. 


The expected steady-state reward rate, Y, can be estimated by [Trivedi92] 


Y = 


— X r{Atj)Atj , 


At,i=r 


(5.10) 


where T is the summation of all Ar/s (particular values of AD in consideration. If we substitute r from Eq. (5.9) and 
let AT represent the holding time of each state in the error model, Y becomes the steady-state reward rate of the VAX- 
cluster, which is also an estimate of system availability (performance-related availability). If we substitute r from Eq. 
(5.9) and let AT represent the time span of the error event for a particular type of error, Y becomes the steady-state 
reward rate of the system during the event intervals of the specified error. Thus, (1 - Y) measures the loss in perfor- 
mance during the specified error event. Note that it is possible that there are failed machines when the system is in an 


Table 5.9. Steady-State Reward Rate for the VAXcluster 


T 

0.1 ms 

lms 

10 ms 

100 ms 

Y 

0.995078 

0.995077 

0.995067 

0.994971 


Table 5.10. Steady-State Reward Rate for Each Error Type in the VAXcluster 


CPU 

Memory 

Disk 

Tape 

Network 

Software 

0.14950 

0.99994 

0.61314 

0.89845 

0.56841 

0.00008 
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error state. Since the model is an empirical model based on the error event data (of which the failure event data is a 
subset), the information about errors and failures of all machines for each particular A t } can be obtained from the data. 


The steady-state reward rate for the VAXcluster was computed with r being 0.1, 1, 10, and 100ms. The results 
are given in Table 5.9. The table shows that the reward rate is not sensitive to r. This is because the overall recovery 
time is dominated by the failure recovery time, i.e., the major contributors to the performance loss are failures, not 
non-lailure errors. In the range of these rvalues, the VAXcluster availability is estimated to be 0.995. Table 5.10 
shows the steady-state reward rate for each error type (r = 1 ms) for the VAXcluster. These numbers quantify the loss 
of performance due to the recovery from each type of error. For example, during the recovery from CPU errors, the 
system can be expected to deliver approximately 15% of its full performance. During the disk error recovery, the aver- 
age system pertormance degrades to nearly 61% ot its capacity. Since software errors have the lowest reward rate 
(0.00008), the loss of work during the recovery from software errors is the most significant. 


5.5.2. Modeling of an Operating System 

The modeled operating system is the IBM MVS system running on an IBM 3081 mainframe [Hsueh87J. The 
measurement period is one year. A Markov model is developed using data collected from the system to describe error 
detection and recovery inside an operating system. The MVS is a widely used IBM operating system. Primary fea- 
tures ot the system are reported to be efficient storage management and automatic software error recovery. The MVS 
system attempts to correct software errors using recovery routines. The philosophy in the MVS is that for major sys- 
tem functions, the programmer envisages possible failure scenarios and writes a recovery routine for each. It is, how- 
ever, the responsibility of the installation (or the user) to write recovery routines for applications. 

Recovery routines in the MVS operating system provide a means by which the operating system prevents a total 
loss on the occurrence ot software errors. When a program is abnormally interrupted due to an error, the supervisor 
routine gets control. If the problem is such that further processing can degrade the system or destroy data, the supervi- 
sor routine gives control to the recovery termination manager (RTM), an operating system module responsible for 
error and recovery management. If a recovery routine is available for the interrupted program, the RTM gives control 
to this routine before it terminates the program. The purpose of a recovery routine is to free the resources kept by the 
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failing program, to locate the error, and to request either a retry or the termination of the program. 

More than one recovery routine can be specified for the same program. If the current recovery routine is unable 
to restore a valid state, RTM can give control to another recovery routine, if available. This process is called percola- 
tion. The percolation process ends if either a routine issues a valid retry request or no more recovery routines are 
available. In the latter case, the program and its related subtasks are terminated. If a valid retry is requested, a retry is 
attempted to restore a valid state using the information supplied by the recovery routine and then give control to the 
program. For a retry to be valid, there should be no risk of error recurrence and the retry address should be properly 
specified. An error recovery can result in the following four situations: 

(1) Resume op (resume operation) — The system successfully recovers from the error and returns control to the 
interrupted program. 

(2) Task term (task termination) — The program and its related subtasks are terminated, but the system does not 
fail. 

(3) Job term (job termination) — The job in control at the time of the error is aborted. 

(4) System failure — The job or task, which was terminated, is critical for system operation. As a result of the 
termination, a system failure occurs. 

A . Model Construction 

The states of the model consists of eight types of error states (see Table 5.11) and four states resulting from 
error recoveries. Figure 5.9 shows the model. The normal state represents that the operating system is running error- 
free. The transition probabilities were estimated from the measured data using Eq. (5.6). Note that the system failure 
state is not shown in the figure. This is because the occurrence of system failure was rare, and the number of observed 
system failures was statistically insignificant. 

Table 5.11 shows the mean waiting tune characteristics of the normal and error states in the model. Note that the 
waiting time distribution of the normal state is the TTE distribution. It has been shown in Section 5.3.3 that this distri- 
bution is not simply exponential (a multi-stage gamma distribution), so the model is a semi-Markov model. In the 
table a multiple software error is defined as an error burst consisting of more than one type of software error. The 
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Figure 5.9. MVS Software Error/Recovery ModeJ 



average dnradon of a mufUple i.s a. lea., tar Un.es longer dran ta. of any lype eTOr „ hkh is lypical|y 

■n Ore range of 20 40 seconds, excep, for DLCK (deadloeh) and OTHR <«he,s>. The average reeovery fhne ftom . 
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program exception is twice as long as that from a control error (21 seconds versus 42 seconds). This is probably due 
to the extensive software involvement in recovering from program exceptions. 


An error recovery can be as simple as a retry or as complex as requiring several percolations before a successful 
retry. The problem can also be such that no retry or percolation is possible. Figure 5.9 shows that about 83.1% of all 
retries are successful. The figure also shows that the operating system is able to recover from 93.5% of I/O and data 
management errors and 78.4% of control related errors by retries. These observations indicate that most I/O and con- 
trol related errors are relatively easy to recover from, compared to the other types of errors such as deadlock or storage 
errors. Also note that "no percolation” occurs only in recovering from storage management errors. This indicates that 
storage management errors are more complicated than the other types of errors. 


Table 5.11. Mean Waiting Time 


State 

# Observations 

Mean Waiting Time (Sec.) 

Standard Deviation 

Normal (Error* Free) 

2757 

10461.33 

32735.04 

CTRL (Control Error) 

213 

21.92 

84.21 

DLCK (Deadlock) 

23 

4.72 

22.61 

I/O (I/O & Data Management Error) 

1448 

25.05 

77.62 

PE (Program Exception) 

65 

42.23 

92.98 

SE (Storage or Address Exception) 

149 

36.82 

79.59 

SM (Storage Management Error) 

313 

33.40 

95.01 

OTHR (Other Type) 

66 

1.86 

12.98 

MULT (Multiple Software Error) 

481 

175.59 

252.79 


B . Model Evaluation 

The steady-state measures evaluated from the model is listed in Table 5.12. The definitions of these measures 
are given in [Howard71 ]. 

(1) Transition probability (/r ; ) — probability that the transition is to state j, given a transition to occur 

(2) Occupancy probability (d> ; ) — probability that the system occupies state j at any time point 

(3) Mean recurrence time (0;) — mean recurrence time of state j 

The occupancy probability of the normal state can be viewed as the operating system avmlability without degra- 
dation. The state transition probability, on the other hand, characterizes error detection and recovery processes in the 
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operating system. Table 5.12(a) lists the state transition probabilities and occupancy probabilities for the normal and 
error states. Table 5.12(b) lists the state transition probabilities and the mean recurrent times of the recovery and 
result states. A dashed line in the table indicates a negligible value (less than 0.00001). Table 5.12(a) shows that the 
occupancy probability of the normal state in the model is 0.995. This indicates that in 99.5% of the time the operating 
system is running error-free. In the other 0.5% of time the operating system is in die error or recovery states. In more 
than half of the error and recovery time (i.e„ 0.29% out of 0.5%) the operating system is in the multiple eiror state. 
The average reward rate tor all software error and recovery states is estimated from data to be 0.2736. Based on this 

reward rate and the occupancy probability for all error and recovery states shown in the table (0.005), the steady-state 
reward loss in the modeled MVS can be evaluated to be 0.00363. 

By solving the model, it is found that the operating system makes a transition every 43.37 minutes. Table 
5.12(a) shows that 24.74% of all transitions made in the model are to the normal state, 24.73% to error states 
(obtained by summing all the it’s for all error states), 25.79% to recovery states, and 24.74% to result states. Since a 
transition occurs every 43 minutes, it can be estimated that, on the average, a software error is detected every 3 hours 
and a successful recovery (i.e., reaching the "resume op" state) occurs every 5 hours. Table 5.12(b) also shows that 
more than 40% of software errors lead to job or disk terminations which cause the loss of service to users. However, a 
few of these terminations lead to system failures. This result indicates that recovery routines in MVS are effective in 
avoiding system failures but are not so effective in avoiding user job terminations. 


Table 5.12. Error/Recovery Model Characteristics 



Normal 



Error State 7 

Measure 

State 

CTRL 

DLCK 

I/O 

PE 

SE 

SM 

OTHR 

MULT 

K 

<t> 

0.2474 

0.9950 

0.0191 

0.00016 

0.0020 

0.1299 

0.00125 

0.0060 

0.000098 

0.0134 

0.000189 

0.0281 

0.00036 

0.0057 

70.0431 

0.002913 


(a) 




Recovery State 

— ■ , — 

Resultant State 

Measure 

Retry 

Percolation 

No-Percolation 

Resume Op. 

Task Term. 

Job Term. 

K 

®(hr. ) 

m 

0.0845 

8.55 


0.1414 

5.11 

0.0712 

10.16 

0.0348 

20.74 


(b) 


c z 





5.6. Software Dependability 


A great deal of research has been performed in the area of software reliability during the development phase. 
Different models have been proposed (reviewed in [Musa87]) to characterize the reliability growth of the candidate 
software through this phase. In general, these models can be divided into two classes. The first assumes that the fail- 
ure rate is a function of the number of remaining defects in the software. Imperfect debugging and uncertainty in the 
projected number of initial defects have also been modeled [Goel85]. The second class of models does not depend on 
the knowledge of the number of the remaining defects [Littlewood80]. The failure rate is assumed to be a random 
variable and the software reliability model involves two stochastic processes. Although most models perform well 
within their own contexts, their performance varies significantly from one data set to another. 

The operational phase of a mature software is much different from the development phase. In the operational 
phase, a typical situation involves frequent changes and updates installed either by system managers or by vendors. 
Often, without notification to the installation management, the vendor will install a change (patch) to fix a fault found 
at some other installation. In a sense, the system being measured represents an aggregate of all such systems being 
maintained by the vendor. In addition, software reliability in the operational phase is also attributed to workload 
effects, hardware problems, and environmental factors. Thus, software reliability in the operational phase cannot be 
characterized by simply applying analytical models proposed for the development phase. 

Studies dealing with software dependability issues for the operational phase have also evolved over the past 15 
years. Software TTE distributions (Section 5.3), dependency between software failure and workload (Section 5.4), and 
modeling of software error/recovery processes (Section 5.5) have been discussed in previous sections. In this section, 
several other issues, including error interactions (i.e., hardware-related and correlated software errors), software fault 
tolerance, and software defect classification are discussed. 

5.6.1. Error Interactions 

When software is running in a complex system, interactions between hardware and software, and interactions 
among multiple processors can cause software error scenarios that cannot been seen during testing. Investigation of 
such error scenarios is helpful for understanding characteristics of software errors in operational systems. In the 
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following, two kinds of such 


interactions between hardware 


error scenarios are discussed: hardware-related software errors , which are a result of 
and software, and correlated software errors , which are a result of interactions among 


processors through software protocols. 


A. Hardware-Related Software Errors 


In [Iyer85a], software errors related to hardware 
More precisely, if a software error (failure) i 


errors were described as hardware-related software errors. 

occurs in close proximity (within a minute) to a hardware error, it is called 
a hardware-related software (HW/SW) error (failure). There 


For instance, a hardware error, such as a flipped 


are several causes of hardware-related software 


errors. 


memory bit, may change the software condition, resulting in a soft- 
ware error. Therefore, even though it is reported as a software . 

possibility is that the software may fail to handle ; 


: error ’ 11 is actual, y caused by faulty hardware. Another 


an unexpected hardware problem such as an abnormal condition in 
the network communication. This can be attributed to a software design flaw. Sometimes, both the hardware error 
and the software error are symptoms of another, unidentified problem. 

Table 5.13 shows * frequency and percentage or hardware-related software errors/failures (among aft software 
errors/failures) measured from an IBM 30 81 system l.yer 8 5b, and two VAXCnstets ITan^bJ. * IBM system, 
approximately 33<S ot aft observed software failures are hardware-related. HW/SW errors are found to Have lame 
ctror-handling times (high recovery overhead,. The system failure probability for the HW/SW enors is close ,o three 
times that for software errors in general. The VAXcluster data shows 


errors are network errors (75%). This indicates that the 


that most hardware errors involved in HW/SW 


major sources of hardware-related software problems in the 
measured ^clusters are „e,w„r,re,a.ed hardware or software compmren. This is a unique feature me multi- 
computer system, where processes highly rely on intercommunications through the network. 


Table 5.13. Hardware-Related Software Etrors/Failures 


Category 

HW/SW Errors 

HW/SW Failures j 

Measures 

Frequency 

Percent 

Frequency 

Percent 

IBM/MVS 

177 

■Eli 



VAX/VMS 

32 

18.9 [ 

28 

n 


92 















B. Correlated Software Errors 


When multiple instances of a software system interact with each other in a multicomputer environment, the 
issue of correlated failures should be addressed. Several studies [Tang90], [Wein90], [Lee91] found that significant 
correlated processor tailures exist in the measured multicomputer systems. Correlated software failures are also found 
m the VAX VMS and the Tandem GUARDIAN operating systems [Lee93aJ. The data showed that about 10% of soft- 
ware failures in the measured VAXcluster and 20% of software halts in the measured Tandem system occurred on 

multiple machines concurrently. To understand how correlated software tailures occur, it is instructive to examine a 
real case in detail. 

Figure 5.10 shows a scenario of cotrelaled software failures. In the Hgure. Europa, Jupiter, and Mercury are 
machine names in the VAXcluster. A dashed line represents that the corresponding mach.ne is in a failure state. At 
one lime, a network error (netl) was reported from the Cl (Computer Interconnect) port on Europa This resulted in a 
software failure (soft!) 13 seconds later. Twenty.four seconds after the first network error (netl), additional ue.wotk 
emus (net2,net3) went reported on the second machine (Juptter), which was followed by a software fa, lure (soft2). 
The error sequence on Jupiter was repeated (uet4.ued.sof, 3) on the third machine (Mercury). The three machines 
experienced software failures concutremly for 45.5 mmutes. All three software fatlures occurred shortly alter network 
emtrs occutred, so they were network etror related. Further analysis of the dan. revealed tha, the network-related 


Figure 5.10. A Scenario ot Correlated Software Failures 


netl soft 1 


Europa 


Jupiter 


13 sec. 


47.83 min. 


net2 net3 soft2 


J 




1 

24 sec. ^ 

^9 sec. 

10 sec. 
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60 sec. 


47.33 min. 

net4 net5 soft3 

45.5 min. 


reboot 


reboot 
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78 sec. 1 11 sec. 


reboot 


4 sec. 
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netl, neO, net5 — Port will be re-started. net2, net4 — Virtual circuit timeout. 
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software of the VAX/VMS is a potential software bottleneck in terms of correlated failures. 


The higher percentage of correlated software failures in the Tandem system can be attributed to the architectural 
characteristics of the system. In the Tandem system, a single software fault can cause halts of two processors on 
which the primary and backup processes (see Section 5.6.2) of the faulty software are executing. If the two halted 
processors control a disk which includes files needed by other processors on the system, additional software halts can 
occur on these processors. (In the Tandem system, a disk can typically be accessed by two processors via dual-port 
disk controllers.) This explains why there is a higher percentage of correlated software failures in the Tandem system. 

Note that the above scenario is a multiple component failure situation not expected in general system design, 
which assumes failure independence. Even the Tandem fault tolerant system is not designed explicitly to guard 
against this situation. Generally, correlated failures can stress recovery and break the protection provided by the fault 
tolerance. 

5.6,2. Software Fault Tolerance 

While hardware fault tolerance techniques have been used successfully, the issue of software fault tolerance is 
still not well addressed. Major approaches for software fault tolerance rely on design diversity [Avizienis84], [Ran- 
dell75]. But these approaches are usually inapplicable to large operating systems because of immense cost in devel- 
oping and maintaining the software. However, some fault tolerance techniques not explicitly designed for tolerating 
software faults can provide a certain amount of software fault tolerance. Understanding such techniques is important 
for designing good approaches to improving software dependability. The Tandem GURDIAN system, running on the 
single-failure tolerant multicomputer system, is a good target for such evaluations. 

The Tandem GUARDIAN operating system is a message-based distributed system built for on-line transaction 
processing [Bartiett78|. High availability is achieved via single-failure tolerance techniques including the process- 
pair approach. For each user program, there are two processes — a primary process and a backup process — execut- 
ing the same program on two processors. During normal operation, the primary process performs all operations lor the 
user, while the backup process passively watches message flows. The primary process periodically sends checkpoint 
messages to its backup. When the primary process detects an inconsistency in its state, it fails fast, and the backup 
process takes over the responsibility of the primary process. This approach can tolerate transient software errors, 
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which will usually not be repeated by reexecuting the process. 

A study of operating system fault tolerance achieved by die single-failure iolerance techniques implemented a 
Tandem multiprocessor system was reponed in [Lee92|. The measured system had 16 processors and was working in 
a high-suess environment. The dam source was die processor halt log maintained by die GUARDIAN system for a 
period of 23 months. The effect of die built-in fault tolerance mechanisms on software availability was evaluated b, 
reward analysis. Two reward functions were defined die analysis. In die definition, i represents die system state m 
which there are t failed processors, and » represenis die torn, number of pmccsom in die system. The firs, function 
(SFT, redeem the fault tolerance of die Tandem system. this funcrion. dm firs, processor ha,, does nol cause any 
degradauon. Fo, additional processor halm, die loss of service is proportional to dm number of precessors halted. The 
secern, funedon (NSFT) assumes no fan,, tolerance. The difference between dm two ftmerions allows evaluadon of 
the improvement in service due to the bmlt-m fault tolerance mechanisms. 

SFT (Single-Failure Tolerance): 


1 - 


i- 1 


if i = 0 
if 0 <i < n 
if i = n 


(5.13) 


NSFT (No Single -Fail ure Tolerance): 


n ~ l ~n °~ i - n (5.14) 

Based on die above reward functions, die expected steady-smm rewind rate, i.e„ the y in Eq. <5. 10), was evalu- 
ated for software, uon-soltwam. md all halm. The results me given Table 5.14. Be bottom row shows die 
improvement in service time fi e., reduedon reward loss) due to dm fan,, tolerance. I, is seen tha, the single-failure 
tolerance m the measured system reduces the service loss due to software halts by 89% and due to non-software halts 
by 92%. This clearly demonsrtates die effectiveness of die implemented fault tolerance mechanisms again.,, software 
(allures as well as non-sollware failures. The table also shows that software problems account for 30% of the service 
(OSS in die measured system fwith SFT). A,m„ugh die system was working in a high-sdess environment, me overall 
reward loss is small , with SFT). This reflects die high availability of die measuied system. 
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Table 5.14. Loss of Service Caused by Halts in the Tandem System 


Measure 

Software 

Non -So ft ware 

All 

NSFT 

1 - Y 

.00062 

.00205 

.00267 

Percent 

23.2 

76.8 

100 

SFT 

i 1 - Y 

.00007 

.00016 

.00023 

Percent 1 

30.4 

69.6 

100 

| Improvement | 

89% 

92% 

91% 


5.6.3. Software Defect Classification 


In recent studies of software detects reported from the IBM MVS operating system fSullivan91] and two IBM 
large database management systems, DB2 and IMS [Sul,ivan92], a software defect c.asstficatton scheme was pro- 
posed. The scheme uses three concepts - ^ type, de fea type, and error trigger - to classify software faults and 
eo-ors. The error type classifies the low-level programming mrstakes that lead to software failures. Tire defect type is 

evel Cla&SlflCatl ° n that d ‘ StinSU,SheS design mistakes > cod,n g mistakes, and administrative mistakes. The 
-or trigger is related to the running environment; it distinguishes several ways that defective code which was not 

executed during testing could be executed at the customer site. Tables 5.15 to 5.17 list major categories generated 



■ u uuuuucu rro 

^Pointer Management a va riable containing the addres s of data is mm.m.H 

S t atement Logic I St a tements are executed the wr ong order or are omitted 

Synchronization An error occ^n lock, ng or svn'chromzai.nn ^ 

-Uninitialized Va riable A variable is us ed before ,t is hritmli™. 
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Table 5.16. Major Categories of Defect Types 


Defect Type 

Description 

Function 

A program s functionality is missing, incomplete, or incorrect ' 

Data Struct/Algorithm 

A data structure or algorithm has a design flaw 

Assignment/Checking 

A coding mistake involves variable assignment or validation 

Interface 

Timing/S ynchronization 

Errors are discovered in the interaction between components 
Errors occur m the management of shared or real-time resources 

Build/Package/Merge 

Errors occur m version control or roll-up of fixes 


from 


Table 5.17. Major Categories of Error Triggers 


Error Trigger 

CO u 

Description 

Workload 
Bug Fixes 

Unusual workload cond, lions such as a user reques. win, unexoecled 
A bug introduced wtien an earlier bug was fixed 

Client Code 

Emorx caused by p,o p a s „, on from applicadon rod, - 

Recovery /Exception 
Timing 

Problems m error recovery and exception handling 
Errors caused by an unanticipated sequence of events 


the data under the three criteria. 


The studies compared the error type, defect type. 


and error trigger distributions of the three products (DB2, 


IMS ' ““ MVS ’ "" ' 0Und *“ ^ -P— However, *e y have some com . 


mon characteristics, such as the mode "undefined state." The 


on system availability for the MVS operating system. A ci 


studies also investigated the impact of software defects 


program’s memoiy) and non-overlay defects demonstrated that 
ary conditions and allocation management were found to be the 


comparison between overlay defects (defects that corrupt a 


the impact of an overlay defect is much higher. Bound- 
major causes of overlay defects. 


5.7. Failure Prediction 


Fault diagnosis and failure prediction are of significance for maintaining highly reliable systems. Measurement- 
information. Several heuristic and statistical i 


error 


teri sties ot anomalous events, such as 


approaches have been proposed. The touristic approach 
error reporLs ILiuKOI or performance anomalies IMaxionWa], , 


extracts charac- 


and relates them 
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" " faU ' L ' by n,lK « SiSna,Ures Wtoacl, uses statistical techniques to quantify re, a- 

•ionships atnong system erne stums defined on fite basis of end, tarns and rocognizes fatlute patterns using dte quanti- 

fled relationships [lyer*,,. In fite following, « discuss two typical approaches: I) latlure prediefion based on dte 
heuristic trend analysis of etror logs and 2) failure prediefiou based on die stafistical analysts of e„or symptoms. 

5.7.1. Prediction Based on Heuristic Trend Analysis 

This approach is based on the observation that a system usually experiences a period of intetmtttent errors 

before a hard failure occurs. The symptoms of intermittent errors can be used to predict impending failures. The early 

study of thts approach showed qualitatively dm, the frequency of etror mples was correlated system failures, based 

on measurements from a DEC dish subsystem ,Tsao83,. Lam,, a heurtsfic Pend analysts method, the dispersion frame 

technique (DFT), was developed [Un90|. DPT determines dte telattonshtp amon 8 errots by exam,„i„ 8 their closeness 
in time and space. 

Two concepts are used in DFT: dispersion frame (DF> and etror dispersion index (EDI). A DF is defined as the 
interval between two successive errors of the same type. The ED, is defined as Ihe number of error occurrences fob 
lowm 8 dte previous DF d„„„ g dte inter,, of one half of me prevtous DF or me DF before me previous DF. Each DF 
is applied » me follow, „g two errors. A hi s h EDI implicates mat me errors f„,,„wi„ 8 me DF used to measure me 
EDI are highly cotrelaied, DFT consists of live heuttstics roles developed Item field experience: 

(1) 3J role: The two consecutive EDIs obiatned by applying me same fiame are a. leas, 3. 

<2) 2.2 rule: The two consecutive EDIs obtained b, applying two success. ve frames ate a, least 2. 

(3) 2 in 1 rule : A frame is less than 1 hour. 

(4) 4 in 1 rule\ Four errors occur within a 24-hour frame. 

<5> d decreasing rate: There are four monotonically decreasing frames, and at leas, one frame is half me stze of ins 
previous frame. 

Figure 5. 1 1 demonstrates an example, including some activated heurtsfics, DFT. In the figure, me top line 
represents the time sequence of five error occurrences (1, .... 5) i n a i 


•size less than 168 hours (1 week) is encountered. Assume that ail 


particular device. DFT is activated when a frame 
the frames in the figure fall into diis threshold. Each 
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Figure 5.11. Dispersion Techniques 

/ 2 3 4 5 

"1 1 | | *“ Time 

DF( 1 ,2) ^ I 1 £ 3,3 warning 

DF(2,3) | I | 2,2 warning 

DF(3,4) -f- . 

DF(45) _|J 4 decreasing 
' J warning 

tr3Ine “ ^ “ * f0 "° WlnS ~ ^ * — - *. - poims „ f die two error occurrences. F„, 

example. DF(,.2, is applied » enors 2 and 3. DF(2.3, is applied » eriors 3 and 4. e!c. An npwan, ^ represents a 
failure warning issued under the above heuristic rules. 

DFT was applied ro die dala colleaed from ,3 pnblicdomain die servers in Carnegie Mellon Universilv over a 
22-tnondi period. Among ,6 hard failures eximnncd. DFT predicared ,5. wid, 5 ftdse idamis. Tfia, is. die successful 
prediction rale IS 93.7%. This resulls shows dial DFT is very effeclive when coupled wid, good syslem insmimenta. 
lion. Hie disadvantage of diis approach Is dia, different systems may require different heurisdcs and parameters. 

5.7.2. Prediction Based on Statistical Analysis 

The objecdve of this approach is to recognize huennitlen, failures duough siaiisdcai analysis and tesune on 
recorded error data. The approach starts hy identifying hey erio, pa, mms potendally symptomadc of failnm occur, 
rences and dien rehnes these pa, mms hy scanning die rest of me dam in siages for similar error pauems. A, each .cage, 
me similarity is snuisdcally iesied. The approach is Illnswed hy me dowchan Figure 5.12. 

In me first singe, dam coalescing is perfomred on me raw dam ro elimlnme rednndan. reporrs. The onrpn, of m,s 
singe is error records (tuples! characlerized by error states (error type, machine condidon. etc.). Nexi all error 
records occurring widiln a small Ume microa, ,15 minnles, are idendhed as errors. groups represen, peri- 
of high error nedvi.y (cririr horse,,. Experience has shown mar when sysiem errors rwcnr in burses of , relarrvely 
ih error rate, the errors are often related. In the second stage, statistical analysis and hypothesis testing are 
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Figure 5.12. Automatic Recognition of Persistent Failures 



performed on each error group to determine whether a valid correlation exists among its members (error records). 
Randomly formed groups in which members are statistically independent are rejected. Thus, the original error groups 
consisting of records among which relationships can exist are refined to the validated error groups consisting ot 
records among which relationships do exist. 
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elauonslups can ex, S . across error groups. i.e, a single cause can g.ve rise !o a persistent error and dm foster 
“ rdaKd "" Sro “ PS " 10 £lm ' MK — *- «*■ are introduced Idr dre analysis 

titat are common lo at least half of the groups in an event A symptom set is denned as die collection of all symptoms 

^ , 13 ^ _ Tiie _ [ ^ __ o _ ^ G( - J ^ 

3 . The error states m these groups are represented by A at 

- ^ TW ° Sympt ° ms “ extra cted from these error states- 

' ““ C °" S ' SK of * ” d ^ «• % -*b consists of ri 5 and * s and c 

6 ' nUS ’ - Sl wd S 2 institute the symptom set tor 


this group. 


- - — - — ... _ . 

- correlation. Two event, are grouped into a super even, if tbey sadsfy any one of * 

Figure 5. 13. Derivation of an Event’s Symptom Set 
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Figure 5. 14. Construction of Super Events 


Event E, 



Event E 5 






1 



SUPER- 
EVENT 1 


CsT^ 


C. S 3 


SUPER- 
EVENT 2 


illusmBs how super events are consmicted. There is no Ume resrricuon when Uie.se rules are applied u, ihe evenc 
Uaia. When a super event is created, a corresponding super symptom set is also created. The super symptom set starts 
with just the symptoms of the lust event of that super event As another event is added, set tntentecdon ,s performed 

between its symptom se, and each of the symptom sens already die super event All uuersections are then added to 
the super event set. 

In each of the above stages, statistical analysis and hypothesis testing are performed to validate the correlations 
among members in the formed groups or sets. The super events derived in die final surge can be used by service engi- 
neers to judge potential failures. This methodology was applied to the on-lme error log files from two CYBER sys- 
tems, and the results were compared to the log of failures and repair maintained by the system staff. In nearly 85% of 
the cases, the engineers were directly able to confirm that the validated super events corresponded to real system prob- 
lems. The evaluation was made both on the basis of their experience and from their field maintenance logs. For the 

remaining 15% of the cases, the engineers agreed that a problem had existed, but that its manifestation was not severe 
enough to be noticed by their analysis. 
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6. CONCLUSION 

In this paper, we discussed methodologies and advances in the area of the experimental analysis of computer 
system dependability. The discussion covered three fields: simulated fault injection, physical fault injection, and mea- 
surement-based analysis of operational systems. The approaches used in the three fields are suited, respectively, to the 
dependability evaluation in the three phases of a system’s life: design phase, prototype phase, and operational phase. 
Before discussing these fields, we introduced several statistical techniques used in all fields. For each field, we pro- 
posed a classification of research approaches or topics. Then we presented detailed methodologies and representative 
studies for each of these approaches or topics. 

The statistical techniques introduced included the estimation of parauietets and confidence intervals, probability 
distrtbutton characterization, several multivariate analysis methods, and tmportance sampling. For stmulated fault 
injections, we covered electrical-level. logic-level, and funcdon-level simuhuion approaches as well as reptesentative 
Simula, ion environments, such as FOCUS and DEPEND. For physical fault injeclions. we discussed hardware, soft- 
ware, and radiatton fault injection methods as well a, several software and hybrid tools, including FIAT, FERRARI. 
HYBRID, and FINE. For measurement-bared analysis of operaltonal syslems, alto an introduction to measurement 
and darn processing techniques, we presented methods used and representative sindies in basic error characterization, 
dependency analysis, Markov reward modeling, software dependabtlity, and fault diagnosis. The discussion covered 

several unporian, issues prevtously studied, including workload/fatlure dependency, correlated failures, and software 
fault tolerance. 


Fault injection simulations can be used to investigate the effectiveness of key design features of fault tolerant 
systems and provide ttmely feedback system designers. Generally, most dependabtlity measures (except tnpu, 
parameters such as failure and recovery rates) can be obtained from simulanons. However, simuhuions need accurate 
input parameters and the valtdauon of output results, which come from physical fault injections and measurement, 
based analysis. Fault injection on real systems can produce information about error latency, error detection, error 


propagation, error recovery, and system reconhguration. but i, can only study ariificial faults and cannot produce some 
dependability measures, such as MTBF and availability. Measurement-based analysis of operational systems unde, 
real workloads can provide valuable infomrution on actual la, lure characteristics and in.s,gh, into analytical models. 
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This type ot analysis provides a means to study naturally occurring errors and all measurable dependability metrics, 
such as failure and recovery rates, reliability and availability. However, the analysis is limited to detected errors. Fur- 
ther, conditions in the field can vary widely from one system to another, casting doubt on the statistical validity of the 
results. Thus, all three approaches are complementary and essential for accurate dependability analysis. 

Significant progress has been made in all the three fields over the past 15 years, especially in the recent 5 years 
during which several dependability analysis tools have been developed. Increasing attention is being paid to: 1) com- 
bining analytical modeling and experimental analysis and 2) combining system design and evaluation. In the first 
aspect, state-ot-the-art analytical modeling techniques are being applied to real systems to evaluate various depend- 
ability and performance characteristics. Results from experimental analysis are being used to validate analytical mod- 
els and to reveal practical issues that analytical modeling must address to develop more representative models. In the 
second aspect, dependability analysis tools are being combined with each other and with other CAD tools to provide 
an automatic design environment which incorporates multiple levels of joint evaluation of functionality, performance, 
dependability, and cost. Software failure data from testing and operational phases are also providing feedback to the 

software design, improving software reliability. Further interesting studies and advances in this area can be expected 
in the near future. 
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